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Abstract 


Monsoon is a parallel processing dataflow computer that will require a high bandwidth 
interconnection network. A packet switched routing chip (PaRC) is described that will 
be used as the basis of this network. PaRC is a 4 by 4 routing switch which has been 
designed and fabricated as a CMOS gate array. PaRC will receive packets via one of 
its four input ports, store the packet in an on-chip buffer, and eventually transmit the 
packet via one of its four output ports. PaRC operates at 50 MHz, and each port has 
a bandwidth of 800 Mbits per second. Each input port operates asynchronously and 
has enough buffering to store four packets. The buffering and scheduling algorithms 
used in PaRC were designed to provide high utilization of the available bandwidth, 
while providing low latency for non-blocked packets. In addition, PaRC provides 
a mechanism whereby a processor can quickly receive an acknowledgment when a 
message it sent has been received. Although the design of PaRC has been driven by 
the needs of Monsoon, PaRC has been designed to be suitable for a wide variety of 
communication networks. 
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Chapter 1 


Introduction 


Multiprocessing is becoming prevalent as a way to increase the price-performance 
ratio of high performance computers. An important aspect of a multiprocessor is 
how its processors are interconnected. Some parallel machines simply connect their 
processors together by a shared bus. This is a very simple scheme that provides for 
quick message transfers and will often allow existing uniprocessors to be used with 
few modifications. But this method cannot provide enough bandwidth for more than 
a few processors; instead, interconnection networks are often used. These networks 
use switching elements to send messages between processors. Messages are sent from 
the originating processor, through one or more switching elements, and then to the 
destination processor. Since communication must now go through the switches, the 
latency of inter-processor communication has been increased. But the bandwidth is 


much greater since many messages can be sent simultaneously. 


There are many types of switching elements and many types of networks that can 
be built out of those elements. The type of network used in a machine has a great 
impact on how the machine performs, and on what types of algorithms the machine 
can run well. PaRC is a routing switch designed to form the network for the Monsoon 
dataflow machine [10][9], but it has been kept general enough to be used in other types 


of networks as well. 


Monsoon is a parallel processing dataflow computer currently being developed. A 


dataflow machine is one which executes using a data-driven model of computation. 


Programs on a dataflow machine consist of dataflow graphs. Each node of the graph 
represents an operation to be performed; each arc in the graph signifies that the 
data produced by the node at the head of the arc is used by the node at the tail. 
The model of computation is called data-driven because an operation is free to be 
executed whenever it has all of its data. This is very different from a traditional 
von Neumann machine in which the operation to be executed next is selected based 
on which operation has just been completed. Since a data-driven model imposes 
fewer constraints on the ordering of operations, it is a promising model for parallel 


machines. 


A dataflow processor operates by consuming and producing packets of data known 
as tokens. A token represents a piece of data flowing along an arc of a dataflow graph. 
A token is simply a piece of data connected to a tag. When a processor consumes a 
token it uses the tag portion of the token to determine what operation to perform on 
the data portion. After performing the operation, it uses the results to produce zero 
or more new tokens. In a single node dataflow machine, a processor will eventually 
consume all the tokens that it produces. In a multi-node dataflow machine such as 
Monsoon, one node may send a token to another, causing the second node to con- 
currently perform part of the computation. In this way, many nodes may cooperate 
on solving a single problem. The language Id [8] is a parallel language designed with 
dataflow processors in mind. From this language a compiler can directly generate 
dataflow graphs. These graphs allow the executable code to retain the parallelism 
that was inherent in the original algorithm. This allows a dataflow machine to exploit 
many opportunities for parallelism that conventional machines do not. To support 
this parallelism, a high bandwidth network is needed to interconnect the processors. 
The rest of this paper will describe the design and implementation of PaRC, a packet 


switched routing chip which will be used as the basis for this network. 


1.1 Monsoon Requirements 


Although PaRC was designed to be flexible enough to be used in many types of 
machines, its main goal was to provide a network for Monsoon. Since PaRC was 
optimized to work with Monsoon, we will first look at the requirements Monsoon 


places on any proposed network. 


The first requirement is that the network provide high bandwidth to and from each 
node. Nodes will be either a dataflow processor or an [-Structure memory board [13]. 
It is noted in [9] and [1] that for compiled scientific code, 30-40% of the instructions 
executed by a processor will be either a store or a fetch. Most, if not all, of these will 
require sending the store or fetch through the network. [9] goes on to say that after 
including other causes for network traffic (e.g., argument passing), on average half 
of the instructions executed will generate a token for the network. Since Monsoon is 


intended to be a high performance processor, a high bandwidth network is a necessity. 


Also, the latency of the network should be kept to a minimum. Most algorithms 
go through periods where the potential parallelism is large, and periods where it is 
small. During periods of high parallelism, most processors will have plenty of tokens 
to process. So even if they have several pending memory requests, they will have 
other operations to execute while the requests are being processed. The opposite 
may be true during periods of low parallelism. A processor may send off a memory 
request and have very little to do until the request is answered. Since the time spent 
waiting for a reply may directly increase the execution time of the program, we want 
the network latency to be as short as possible. The latency of a network is a function 
of many factors, such as the configuration of the network, the latency of the network 
switch, and the amount of traffic in the network. Since during low parallelism traffic 
may be light, and since we are especially concerned with latency during periods of 
low parallelism, we want to pay particular attention to minimizing the latency during 
light traffic. In particular, we want to minimize the latency of non-blocked packets. 
By “latency of a non-blocked packet,” we mean the delay from when the first word 


of a packet is received to when it is transmitted, assuming that the packet does not 
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have to wait due to other packets using the output it needs. 


Monsoon will use its network solely for passing around tokens. Given that tokens 
are of a fixed sized, the messages in the network will all be of the same size as well. 
Additionally, a Monsoon processor will sometimes need to know when a packet it sent 
has been received. The network should provide some means to acknowledge that a 
packet has arrived at its destination. In PaRC this is supported by allowing special 
packets called “Circuit Switched Packets.” Also, the number of nodes in Monsoon 
may vary from just a few nodes to many hundreds. The network must be able to 


support all sized machines. 


Lastly, let us emphasize that even though PaRC is being designed to support 
Monsoon, this is not its only goal. PaRC will be flexible enough to support many 
different types of networks. 


1.2 Network Overview 


Before designing PaRC, we first had to determine what type of network we wanted 
to build. Based on the above requirements, a number decisions about the network 


can be made. 


The first is to make the network a packet switched network. A packet switched 
network (also called store-and-forward) is one in which messages are sent as a unit 
from point to point in the network. Switches have several inputs which are used to 
receive messages (packets), and several outputs which are used for sending packets. 
When a packet is sent to a switch, that switch stores the packet in a buffer. If the 
output that the packet needs to go to is available (i.e., the output is not currently 
sending some other packet), the packet will be sent out through that output. If the 
output is unavailable, then the packet is blocked and, it will stay in the buffer until it 
can be sent out. A switch will accept packets as long as it has room to buffer them, 


regardless of whether or not the packets outputs are available. 


An alternative to packet switching is circuit switching. In a typical circuit switched 


ll 


network, messages are transmitted by making a complete connection through the 
network from the sender to the receiver. The message is sent along this path, passing 
right through the switching nodes.’ When making the connection, the header of 
a message moves from switch to switch, establishing a connection from the sender 
to the header’s current location. If the header reaches its destination then there is 
a complete path from sender to receiver, so the message will be successfully sent. 
However, when the header enters a switch, the output it wants to use may already 
be in use. When this occurs the connection is not made, and the sender will have to 


make another attempt to send the message. 


The main reason for choosing packet switching over circuit switching is the high 
throughput that is needed. A “link” is a connection which sends data to and from 
a switch. Depending on the machine, these links often make connections between 
boards or even between racks. Links are often the most expensive components of a 
network. Given a limited number of links, we need to make the best possible use of 
their bandwidth. A packet switched network will make better use of its bandwidth 
than would a similar circuit switched network. If a message can currently be sent 
only part way to its destination, a packet switched network will send the message as 
far as it can. A circuit switched network would not, thus wasting bandwidth that the 


packet switched network uses. 


A disadvantage of packet switched networks is that they often have a longer latency 
than circuit switched networks. At each stage of a network, a packet may have to wait 
while other packets use the link it needs. Additionally, even if the packet does not 
have to wait for other packets, the switch may wait until the packet is fully received 
before it sends the packet to the next stage. This cause of latency can be lessened by 
allowing the switch to send out a packet before it has fully arrived. This is known as 
streaming. When this is done, and a packet’s path is clear, the latency for a packet 


switched network will be comparable to that of a circuit switched network. 


Another important characteristic of a network is its topology. The topology 


'Circuit switching is often compared to making a phone call, whereas packet switching is com- 
pared to sending a letter through the mail. 
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of a network describes how the elements of a network are connected. An n-cube 
topology[11] (also called butterfly) was chosen for Monsoon. Figure 1.1 shows a net- 
work made of 2x2 butterfly switches. 


Stage 1 Stage 2 Stage 3 Stage 4 


x 
ou 
e 


nHnHOOZ 


Figure 1.1: A Butterfly Network using 2x2 Switches. 


An important feature of this topology is that any size network can be made out 
of the same switches. For networks such as the hypercube (which is used in the 
Connection Machine [15] and the Cosmic Cube [12]), the number of connections to a 
switching node is a function of the number of nodes in a network. Since the number 
of nodes in Monsoon will vary greatly, it is useful to have a switching node whose 


complexity will not increase as the size of the machine increases. 


Another benefit of the butterfly topology is that it allows a message to go between 
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any two nodes while passing through only a small number of switches. This is impor- 
tant since each node that a message passes through adds to its latency. In a butterfly 
network, a message will need to pass through only O(log NV) switches, where N is the 
number of nodes being connected. This is also true of the hypercube, but is not true 
of networks such as a mesh. A mesh uses fixed sized switches, as does a butterfly, 
but a message may pass through O(/WN) or O(WN) switching nodes to get to its 


destination. 


In the mesh and hypercube networks mentioned above, each switching element 
was associated with a node. These are called direct networks. The Monsoon network 
will be an indirect network. In this network only the switches in the first and last 
stages will be connected to nodes. All other switches will be connected only to other 
switches. An advantage of a direct network is that each node has several nodes which 
are very close to it (i.e., significantly closer than the average node). This can be 
taken advantage of when writing programs. By trying to get nodes to communicate 
primarily with nodes they are close to, a programmer may be able to reduce the 
average distance messages need to travel, and thereby speed up the program. In 
Monsoon there is no sense of locality between processors. In other words, when one 
node needs to send a message, it is equally likely (or nearly so) that that message 


will go to any of the other processors. 


A large Monsoon system will be spread out over many boards in many different 
cages. Providing a synchronous clock over such a large machine would be a very 
difficult task.* Since the nodes of this system are independent and may operate 
asynchronously, we did not want to provide a synchronous clock just for the network. 
For this reason each PaRC will operate asynchronously. This means that packets 
coming into PaRC from different places will all be transmitted from separate clock 
domains. The cost of synchronizing these incoming packets is that the latency of a 
packet is slightly increased. Since the networks will be spread out over many boards, 
some of the connections to and from PaRC may be quite long. This will affect the 


116] shows some of the difficulties of providing a synchronous clock, and shows one way of 
overcoming them. 
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flow control mechanism. PaRC should work well on all sized links, including both 
short links between PaRCs on the same board, and long links going between racks. 
Lastly, the network should also preserve the ordering of packets. If two packets have 
the same sender and the same receiver, the packets should arrive in the same order 


in which they were sent. 


1.3. Road Map 


The work in this report is a continuation of the work begun in my bachelor’s thesis[6]. 
The ideas in that thesis have been expanded upon and used in the design and fabrica- 
tion of a PaRC chip. The rest of this report describes the design and implementation 
of PaRC. The next chapter begins with an overview of PaRC and then describes 
some of the key ideas that had the greatest impact on the design and performance 
of PaRC. Chapter 3 describes some of the implementation details of each section of 
PaRC. Chapter 4 describes the process of generating the test vectors which are used 
to test newly fabricated chips. The final chapter gives some conclusions and men- 
tions some changes that future versions of PaRC could incorporate. The appendix 
contains the PaRC User’s Guide. This guide contains the information users need to 
know when including PaRC in their system. Since this guide is meant to stand on its 


own, it duplicates some of the information contained in the body of this document. 
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Chapter 2 


Design of PaRC 


2.1 Overview of PaRC 


PaRC is a 4 by 4 routing switch on a chip. It receives packets via one of its 4 input 
ports, writes them into an on-chip buffer, and eventually sends them out via one of 
its 4 output ports. This section will provide an overview of the major components of 
PaRC. Figure 2.1 shows these components in a top level diagram of PaRC. A more 


detailed description of these components can be found in Chapter 3. 


The first major component is the input port. This component is responsible for 
receiving packets, checking them for errors, and writing the packets into memory. 
The input port must also determine on which output port a packet should be sent. It 
then makes a request to that output port’s scheduler, telling the scheduler in which 
buffer the packet is being stored. The input port must also notice when its memory 
is filling up, and then notify the sender not to send any more packets. The memory 
of each input port is composed of 4 separate buffers, each of which can store exactly 


one packet. 


Each output port is made up of two components: a scheduler and a transmitter. 
The scheduler keeps track of all the packets which need to use the output port and 
chooses which packet will be transmitted next. If two or more of the waiting packets 


were received via the same input port, the scheduler must guarantee that they are 
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The major datapaths and I/Os of PaRC are shown 


Figure 2.1: Top Level View of PaRC 


ies 


transmitted in the same order in which they arrived. The transmitter is the com- 
ponent which sends packets off the chip. When the transmitter is ready to send a 
packet, it first finds out from the scheduler where the next packet is being stored. 
The transmitter then reads the packet out of its buffer and transmits it to the next 
stage of the network. If the packet is a circuit switched packet (i.e., a packet which 
requires an acknowledgment), the transmitter must also ensure that an acknowledg- 
ment is produced at the appropriate time. Circuit switched packets will be discussed 


in more detail in Section 2.4 


The control section has two main functions. One function is to keep statistics on 
the performance of the network. The other is to provide an interface by which the 
performance of PaRC can be controlled and monitored. Through this interface all of 
PaRC’s programmable features can be controlled (such as how routing is done). This 
interface can connect to a network control system which will allow the operation of 


the network to be controlled by software. 


2.2 Buffer Utilization 


A design decision that greatly affects the performance of a packet switched network is 
how much buffering each switch has and how that buffering is used. Typically, store 
and forward networks store messages coming in over the same input in one long fifo 
(first-in-first-out) buffer associated with that input. PaRC does not do this. Instead, 
it splits the memory associated with each input port into four separate buffers, each 
of which is large enough to hold exactly one packet. The principle reason for doing 


this is to maximize the chip’s throughput. 


If the memory was organized as one large fifo queue, then it would only be possible 
to read out the packet at the top of the queue. When that packet was blocked, nothing 
could be read out of that fifo, even if the other packets in the fifo were not blocked. 
This puts a severe limitation on the throughput of the network. Assuming that the 


top of each port’s fifo has a packet which is heading for a random destination, then on 
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average only 2.7 of those 4 packets would be unblocked. This means that even if the 
links entering a chip can supply a packet whenever room is available, the utilization 
rate for links leaving that chip would be only 68%. This means that one-third of the 
each link’s bandwidth would be wasted! 


This rate can be improved by looking at more than just the top packet. If the first 
packet is blocked we could look at the next packet and send that packet if it is not 
blocked. If we do this then the best case utilization rate rises from 67% to 80%. (The 
best case utilization rate is defined to be the average utilization of output ports if all 
buffers always have a packet. This is not a measure of how well we expect to do, but 
is a limit on how well we could possibly do.) If we can look at all four packets and 


send the first one that is not blocked the best case utilization rises to 90%. 


The ability to read from any buffer can be used to our advantage even more. We 
can give each buffer its own output circuitry, thus allowing each buffer to be read 
independently. It then becomes possible to read out two (or more) buffers from the 
same input port at the same time. This permits each output port to read from any 
buffer that has a packet for it, regardless of which other buffers are currently being 
read. By doing this, the best case link utilization rate for PaRC rises to 99%. Another 
advantage of adding output circuitry to each buffer is that it makes it faster to begin 
to read out a packet. Since the buffer always knows which word will be read next, 
it can begin the read of that word on the cycle before it is needed. This helps to 


minimize the latency of packets. 


Two properties of PaRC are exploited in reaching this best case utilization rate. 
First, we make use of the fact that all of the packet memory can be put on-chip. If 
packets were stored externally it would be very expensive to use this scheme because 
it would greatly increase the number of chips needed for packet buffering. It would 
require a larger number (at least one per buffer) of small memories (or fifo memories), 
as well as additional external circuitry to multiplex together the results. (Using 


multiported memories could reduce this cost somewhat.) 


Also, if the packets were not of fixed size, but their sizes varied greatly, this scheme 
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would be very inefficient. To use this scheme, we would have to make each buffer large 
enough to hold the largest allowable packet. For smaller packets, part of the buffer 
space would be wasted. In a single fifo scheme, memory space is not wasted because 
each packet takes up only as much room as it needs. So given a fixed amount of 
memory, the separate fifo scheme would buffer significantly fewer packets. Although 
if most packets were close to the maximum packet size, this loss would be relatively 


insignificant. 


These are significant improvements. Not only do they increase the throughput 
greatly, but they also decrease latency; packets which would have been stuck behind 
other packets can now be sent as soon as their output port is available. These im- 
provements are successful because they help to prevent output ports from being idle 


while there are packets waiting to use them. 


These improvements, which were first described in my bachelor’s thesis, are related 
to the recently published Virtual Channel Flow Control scheme of Dally [3]. This 
scheme is mostly concerned with networks where packets may be large enough so 
that switches will not buffer entire packets. The concept of virtual channels are used 
to achieve results similar to PaRC by restricting a packet to use only a portion of the 
buffering in an input port. This is done by dividing up the buffer space into several 
smaller separate fifos, and assigning each fifo to a different “virtual channel.” When 
a packet is sent to a switch, the sender must associate the packet with a virtual 
channel that is not currently in use, and the packet can only use the buffer space 
associated with its virtual channel. If a packet becomes blocked it will fill up its 
buffer space, and the sender will stop sending the packet until more space is available 
for it. Since a packet can only use the space associated with its virtual channel, it 
does not fill up the rest of the buffer space in that input. This allows another packet 
to be sent to that input and stored into the buffers associated with a different virtual 
channel. This packet may take a different path than the earlier packet so it may not 
get blocked. This improves both the average throughput and latency of the switch. 
PaRC’s buffering scheme is similar to a virtual channel scheme in which there are 


four virtual channels, and each virtual channel has enough buffering to store exactly 
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one packet. PaRC’s scheme is simpler since it does not deal with buffering only part 
of a packet. PaRC’s scheme also requires less bandwidth overhead since the receiver 
does not have to be told what virtual channel the data is being sent on, and since 


separate flow control information is not needed for each virtual channel. 


PaRC’s buffering strategy greatly improves the chip’s throughput and latency, but 
these improvements do come at a price. In addition to adding to the size of the 
packet buffers, this scheme makes the scheduling problem much more complex. The 
scheduler must still ensure that packets following the same path do not get out of 
order. More precisely, packets that arrive through the same port and which will go 
out the same port must be sent out in the same order in which they were received. 
This is easily done when there is a single fifo buffer, since packets can only be read 
out in exactly the same order they arrived. This is not true in PaRC; there is no way 
to tell which of two packets arrived first simply by looking at packet memory. The 


scheduling strategy will have to solve this problem. 


2.3 Scheduling Strategy 


The scheduling strategy had to be designed to take advantage of the large number 
of independent fifo buffers that each transmitter could read from. It also had to deal 
with the ordering problem outlined above. There are two types of schedulers that 
PaRC could have used: centralized and distributed. In the distributed method, each 
output port has its own scheduler; while in a centralized method there is one scheduler 
that makes the scheduling decisions for all the ports. A centralized scheduler has the 
advantage of more flexibility; it can easily deal with packets that can be sent to more 
than one output port. However, since it is more complex, it is slower and probably 
cannot schedule more than one or two packets each cycle. A distributed method was 
chosen so that ports would not have to sit idle for several cycles while waiting for the 


scheduler to get a chance to schedule them. 


In a simple distributed scheme, each packet buffer has a request line to each sched- 
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uler and holds it high whenever it has a packet going to that output port. This 
strategy would not work in PaRC because it would fail to guarantee that packets 
traveling between the same ports are kept in order. Adding a timestamp to each 
request would not work because there is no limit on the amount of time that a packet 


could have to wait in a buffer. 


The first idea on how to deal with this problem was to prioritize each of the 
packets in an input port. The packet which has been waiting the longest would have 
the highest priority. This worked as follows: When a packet arrived and all buffers 
were empty, it would be given the highest priority, 3. If another packet arrived before 
this packet left, it would be given a slightly lower priority: 2. (In general the priority 
given to an incoming packet is (¢ — 1), where i is the number of available buffers). 
Each time a packet is removed, all remaining lower priority packets would have their 
priority increased by 1. Note that if several packets were removed on the same cycle 
each of the remaining packets would have to be increased by an appropriate amount. 
On each cycle the input port would have to look at its four buffers and choose the 
highest priority packet going to each output port. The schedulers would then choose 
between the packets chosen by the input ports. This scheme would work but the logic 
for it would be very complex and, more importantly, slow; thus adding extra cycles 


to the latency of non-blocked packets. 


To simplify (and thus speed up) scheduling, a way to use the advantages of single 
fifo buffering is needed. This is done by putting a fifo queue in each scheduler; but 
instead of buffering packets, these fifos only need to buffer pointers to packets. When 
a new request arrives, a pointer to the requesting packet is added to the bottom of 
this scheduling fifo (s-fifo). Choosing the next packet to be transmitted is as simple 
as reading the top of the s-fifo. This scheme makes it possible to start reading out a 


packet on the cycle immediately after its request has been made. 


There is one inefficiency we must tolerate with this scheme. To guarantee that 
none of these s-fifos fill up, each scheduler’s s-fifo must be large enough to store a 


request from every input fifo. So there is four times as much s-fifo space as could ever 
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be in use at one time. This is bearable because each s-fifo location is small (5 bits: 
4 to point to the packet, and one to indicate whether or not the packet is a circuit 
switched packet). The total size of all the s-fifos is equivalent to approximately two 
packets worth of buffering. 


PaRC’s buffering and scheduling schemes, both of which were described in my bach- 
elor’s thesis, are related to the later work of Frazier and Tamir. Their dynamically- 
allocated-multi-queue (DAMQ) buffer [4] is a way to configure and control an input 
port’s memory in order to reduce output port contention. In a DAMQ buffer the 
memory is split into fixed sized blocks. Packets are stored using one or more full 
blocks. Several logical queues of these blocks are maintained in the DAMQ buffer, 
and the DAMQ is able to move blocks between queues. One of these queues is a 
list of free blocks; the rest are queues of blocks which contain packets headed for the 
same output port. At a given time the head block of any one queue may be read out 
of memory. As with PaRC this increases the performance of a machine by allowing 
packets to leave an input port in a different order than they arrived. However the 
DAMQ buffer is not as efficient as PaRC since only one packet can be read out of an 


input port’s memory at a time. 


By maintaining queues in each input port, a switch using DAMQ buffers solves the 
problem of keeping packets in fifo order. In a DAMQ buffer, each queue has its own 
head and tail pointer; each block of memory also has its own pointer which is used 
to establish its position in a queue. The state of these pointers describes the current 
state of the queues. The DAMQ moves blocks between queues by manipulating these 
pointers. However moving a block between queues (e.g., from the free list to an output 
port’s queue) is a complex task that takes three cycles. (It actually takes 6 cycles 
since the control of these pointers is shared between logic for incoming and outgoing 
data.) PaRC does a similar job by using a scheduler fifo. Since PaRC uses an actual 
fifo its control is much simpler. In each cycle an item can be added to and removed 
from the queue. The price for this speed and simplicity is that extra memory space 
is needed since at most one-fourth of s-fifo memory is used at any one time. Also, 


by placing the queue in the output port rather than the input port, PaRC provides 
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first-come-first-served scheduling for packets which arrive via different input ports. 


2.4 Circuit Switched Packets 


2.4.1 Why They Are Needed 


Circuit switch packets were added to PaRC so that a processor can send a packet 
and be sure that it has been received before continuing. Although most of the time 
processors do not care when their messages are received, there are a few cases where 
it does matter. One example is when memory is to be deallocated. An object in 
memory can be deallocated only if we know it will never be used again. In particular, 
when it can be proven that an object is only used within a certain block(s) of code, 


we would like to be able to deallocate that object after running that code. 


To safely deallocate an object we must be sure that all reads and writes of that 
object have been completed. Since code will only complete after all the reads have 
returned, we know that once the code has run there are no outstanding reads. How- 
ever, this is not true with writes. It is possible for a location to be written but never 
read; so just because the code has completed does not mean that all the writes have 


completed. 


If all the writes, as well as the deallocation, were issued by the same processor, this 
would not be a problem. The deallocation would be issued after the writes and the 
network would guarantee that they arrived in the same order. But this is not always 
the case. Node A may send the writes to node B and then tell node C that the code 
block has completed. Node C may then send the deallocation to B. It is possible that 
it may take longer for a write to go from A to B than for the deallocation information 
to go from A to C to B, this would cause the write to arrive after the object has been 


deallocated. 


To prevent this from happening node A must not tell C that it is done until all of 


its writes have completed. One way for A to know that the write is complete is to do 
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a read of the location. But this is slow, as it would require at least the full network 
transit time of two packets. This would also increase network traffic, since it requires 
sending two extra packets. Another way is to do a write which requires the memory 
to send back a packet to acknowledge the write. However, this would still be slow and 
would still increase network traffic (although only by one extra packet). Instead we use 
packets which can generate an acknowledgment without sending back an additional 
packet. These are called circuit switched packets. In the above example, processor 
A can send the write as a circuit switched packet, so it will get an acknowledgment 
when the packet arrives. Once A knows that its write has arrived at B, it can inform 
C that it has completed. If C then sends a deallocation message to B, that message 


could only arrive after the write. 


2.4.2 How They Work 


The original idea for circuit switched packets was that as they traveled through the 
network they would hold on to the links they crossed, thus creating a connection 
from the sender to the receiver. This connection would be similar to those in circuit 
switched networks, hence the name “circuit switched” packets. When the packet 
reached its destination, an acknowledgment signal would be sent back along this 
connection, thereby informing the sender that its packet had arrived. This signal 
would also break each connection after crossing it. This scheme had the drawback 
that links could be blocked for long periods of time during circuit switched packets. 
During most of this time the link would be idle, since once the packet was sent, 


nothing else could be sent until the connection was broken. 


This scheme was greatly improved by allowing normal packets to be sent across 
the link while waiting for the acknowledgment signal. All that must be done is to 
keep track of where the packet came from. So now a circuit switched packet is sent 
the same way as any other packet, the only difference is that when the packet leaves 
a PaRC chip, the transmitter keeps a “back pointer” to the port which the packet 


came from. These back pointers lead from the packet’s current location back to the 
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sender of the packet. When the packet reaches its destination an acknowledgment 
signal is generated which uses these pointers to find its way back to the sender and 
inform it that its packet has arrived. This does not eliminate all blockage due to 
circuit switched packets. If a circuit switched packet needs to be sent out through 
a port that is already waiting for an acknowledgment, the port will be blocked until 
the acknowledgment for the first circuit switched packet arrives. If the second circuit 
switched packet were sent, then when an acknowledgment signal arrived, there would 


be no way of knowing which packet the acknowledgment was for. 


Another way to reduce the disruption of circuit switched packets is by minimizing 
the time spent waiting for the acknowledgment signal. This is done in two ways. The 
first is minimizing the time it takes for an acknowledgment to be routed through a 
PaRC chip. We do not wait for the acknowledgment to be latched in and synchronized 
before we pass it on. Instead as soon as a PaRC chip receives an acknowledgment, it 


begins to send it on. 


The second way to minimize the time spent waiting for acknowledgments is by 
generating each acknowledgment as soon as possible. One way to acknowledge circuit 
switched packets is to have the receiver generate an acknowledgment once it receives 
the packet, but it is not really necessary to wait this long. It could be done sooner 
by taking advantage of the property that once a packet is being sent to the receiver, 
no new packet could possibly get there before it. This means we can have the PaRC 
chips in the last stage of the network generate an acknowledgment as soon as they 


begin sending a circuit switched packet to the processor/memory node. 


The acknowledgment can be generated even sooner if we take advantage of the 
fact that each node receives packets from only one PaRC chip. Once a packet has 
been received and its request placed on the appropriate s-fifo, there is no way that 
any new packet can get ahead of it on that s-fifo. Since that is the only way to get to 
the receiving node, no packet not already on the s-fifo can get to the receiver before 
this packet. This means we can generate the acknowledgment as soon as a packet’s 


'Rach link has a dedicated wire that is used for transmitting this acknowledgment signal. This 
wire transmits data in the opposite direction as the rest of the link. 
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request is stored. Taking this one step further, we can have the next to last PaRC 
chip generate the acknowledgment as soon as it sends out the circuit switched packet. 
This can be done since once the packet has started to be sent out, it is guaranteed to 
be put on the next s-fifo within a short, fixed period of time. When this is done, the 
final stage of PaRC chips should treat all packets as normal packets. 


2.5 PaRC Interface 


When sending packets, the output ports of PaRC will have to send those packets ac- 
cording to a specified protocol. Of course, the input ports will have to receive packets 
according to this same protocol. This protocol will define the interface between PaRC 
chips. This interface is important because it will influence the number of wires needed 
in a link (and hence its cost) and the amount of usable bandwidth we get out of those 


wires. 


Each link will consist of 16 bits of data, accompanied by a clock. Each link 
needs its own clock because, as mentioned earlier, each part of the network operates 
asynchronously. It is assumed that the rising edge of the clock occurs while the data 
is stable. Since PaRC can operate at 50MHz, this gives us a raw bandwidth of 800 


Mbits per second per port. 


The data that the processor needs to send are messages which are 144 bits long. 
Since our datapath is 16 bits wide, this data takes up 9 words of our packet. In 
addition, a packet also contains 2 extra words which are used to check for errors.” 
Finally, we need information on how to route the packet. This information will be 
placed in the first word, the header, since we want to be able to start sending out a 
packet soon after it arrives. This gives us a 12 word packet whose format is shown in 


the following table. 


2Optionally, one or both of these words can be used for data instead of error checking. 
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USE 
0 [Meaderwort] = [6 ‘| Data word 
pS [Data word 7 


- 3 [Data word 5 
CRC word 0 
CRC word | 


PaRC Packet Format 


The above information describes what packets look like, but it doesn’t say how an 
input port can determine when a packet begins. One possible way is to have an extra 
wire which provides a frame bit. This bit would go high only during the first word 
of a packet. This would work but would be a very inefficient use of the wire. Instead 
PaRC uses one of the bits in the header (the uppermost bit) as a Start-Of-Packet 
(SOP) bit. When an input port is not receiving a packet, it looks at the value of this 
bit. When the SOP bit is 0 then no packet is being started. When it becomes a 1 
then a packet is being started, and this word is its header. The next 11 words must 
then be the rest of the packet. During these words the bit position used as a SOP 
bit is used as a normal data bit. In this way only one bit per packet is “wasted” as 
overhead. If we had used a separate wire for a frame bit, we would have wasted 12 


bits of bandwidth per packet. 


Once the 12 words of a packet have been received, the input port begins looking 
at the SOP bit again to see when the next packet begins. There is no need for an idle 
word between packets; an output port may begin to send a new packet immediately 
after the previous one has completed. This also helps to make maximum use of our 


bandwidth. 


In addition to the SOP bit, the header of a packet also contains up to 15 bits of 
information which will be used to determine how the packet will be routed. The use 
of these bits will be described in the section on routing (Section 2.8). Optionally, one 
of these bits can be used to indicate whether or not this packet is a circuit switched 


packet. The format of the header is: 
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Header Format 
ie 


CSP/ROUTE-DATA.14 | ROUTE-DATA [13.0] 


When an output port is not transmitting a packet, it will transmit one of 2 specified 


idle patterns. (Of course both of these patterns have a 0 in the SOP bit.) The 
output port will either alternate between these patterns (to keep the output data bits 
changing) or can always use the same pattern. (This will cut down on the power 
usage.) Specified idle patterns are used to allow an input port to detect virtually all 


link transmission errors. How this is done will be described in the following section. 


2.6 Error Detection 


The PaRC interface is designed to detect virtually all link transmission errors. If an 
error were to go undetected, Monsoon could produce an incorrect result. We want to 
minimize the possibility of this happening. It is also important that we detect where 
the error occurred, so that we can take steps to prevent it from reoccurring. If we 
only did end-to-end checks (i.e., checking the validity of packets as they enter and 


leave the network) we would not be able to do this. 


2.6.1 Types of Errors 


There are three types of errors that an input port can detect on a link. These are 


room-error, CRC-error, and idle-error. 


Each input port only has enough buffers to store 4 packets. A room error occurs 
when a packet arrives and there is no room for it.? When this occurs the input port 


will receive the entire packet and simply discard it. 


Error checking is done on packets by use of a Cyclic Redundancy Code (CRC). As 
each data word of a packet is received, it is accumulated into a checksum. PaRC uses 


’The flow control mechanism should prevent this from occurring. See section Section 2.7. 
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a 32 bit checksum and accumulates values using the CRC32 polynomial. The header 
and the 9 data words are accumulated to produce a checksum; this checksum is the 
32 bits that should appear in the final two words of the packet. A CRC error occurs 
when the checksum computed by the input port does not match the CRC at the end 
of the packet. Using this code there is less than a 1 in 10° chance of an incorrect 
packet being mistaken for a correct one. In addition there are certain types of errors 


which will always be detected. More details on the CRC are given in Section 3.3. 


As mentioned earlier, PaRC’s output ports always output one of two specified 
patterns, called idle patterns, when they are not sending a packet. When an input 
port is not receiving a packet it checks to see that each word it receives is one of the 
idle patterns. An idle error occurs when a word is not part of a packet and is not one 
of the idle patterns. By checking for CRC and idle errors, the input port is able to 


detect virtually all link transmission errors. 


Here are the possible transmission errors, and how they will be detected: 


e Error occurs inside of packet (i.e., any bit except SOP bit). This will cause the 


checksum to be incorrect, thereby creating a CRC error. 


e Error turns SOP bit from 1 to 0. The input port will expect the word to be 
an idle pattern, since a packet is not starting. Since it probably will not be, an 
idle error will occur. In addition, the remaining words in the packet will either 


cause idle errors, or begin a packet which will have a bad CRC. 


e Error occurs in SOP bit of idle pattern (i.e., turns SOP bit from 0 to 1.) This 
will create a packet which will have a bad CRC, causing a CRC error. 


e Error occurs in other bits of idle pattern. This will give an idle error (but will 


not cause any loss of data). 


This makes it extremely unlikely that an undetected error will occur. And when 
errors do occur, it will be easy to pinpoint their location since we can tell where the 


error first appeared. 
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Since the network will be spread out over many boards, the links between PaRC 
chips will need to go from board to board. To reduce the number of wires that must 
be sent between boards, and to reduce the chance of errors on these connections, the 
Data Link Chip [2] was designed. This chip will take the 16 signals from from an 
output port, multiplex them into 4 values, and transmit them differentially at 4 times 
the speed of PaRC. A DLC will also be at the receiving end of the link to receive 
these signals. It will demultiplex these values and send them to an input port, along 
with a clock. The transmitting DLC can also detect errors and report them to the 


PaRC chip. 


When the network is initialized, each transmitting DLC will synchronize itself to 
the output clock being produced by the port it is connected to. This will let it know 
when it is safe to sample the data coming from the output port. This output clock 
is generated from PaRC’s main clock (ICLK). Since ICLK is generated by the DLC 
from the DLC’s clock, the DLC should stay synchronized to the data out clock. If the 
data-out clock were to drift from its original position in relation to the DLC clock, it 
is possible this synchronization could be lost. The main reason that the clock might 
drift is the changing speed of the PaRC circuitry as it heats up. Synchronization 
should not be lost even when PaRC moves across its entire allowable temperature 
range. Nevertheless, the DLC continually checks to ensure that synchronization has 


not been lost. 


When a DLC detects a possible problem with its synchronization, it can report 
one of two errors: LINK-ERR and LINK-GRAY. The difference between these errors 
is that when a link-gray occurs no data has been lost. When a link-err occurs it is 
likely that bad data has been transmitted. (Note that if bad data was transmitted, 
it should be detected by the input port that is receiving data from that link.) When 
one of these errors is detected the link should be taken out of use, resynchronized, 


and put back in use. All of this can be done via software. 
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2.6.2 How Errors are Dealt With 
When an error is detected we would like to 


e Prevent any data from being lost, if possible; else minimize the amount of data 


lost. 


e Determine where the error occurred. 


To prevent data from being lost, or minimize the amount that is lost, when we 
detect an error we may want to stop using the connection on which the error occurred. 
To help do this, when PaRC detects an error, it will immediately bring its ERROR 
output high. The network control system can see this and stop the ports which are 
using that connection. This prevents any more data from being lost. Unfortunately 


it may take many PaRC cycles before this can be done. 


Because of this PaRC can be programmed to go into “passive mode” when it 
detects an error. When in passive mode, PaRC will not begin to transmit any more 
packets. In addition, when in passive mode each of the input ports will tell the port 
which sends packets to it to stop sending packets. This should stop packets from 
being sent to or from the PaRC chip, thus limiting the amount of data that could be 
lost.4 In particular, if the only errors recorded were from the link, then no data at all 
will have been lost. Software can tell PaRC to resynchronize the links and have PaRC 
exit from passive mode; the machine will then continue normally. Even if errors were 
detected by the input port it is possible that no data has been lost if the errors were 


caused by corrupted bits in idle patterns. 


To help determine where the error occurred, PaRC keeps track of what errors it 
has detected. For the errors detected by input ports, PaRC keeps track of how many 
(0, 1, 2, or more than 2) of each error type was detected by each port. For each type 
of link error, PaRC keeps track of whether or not that error has occurred on each of 
the output ports. All of these values can be read via PaRC’s control port. 


4 Although if the flow control line is giving erroneous values, packets may continue to be sent to 
the chip. 
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2.7 Flow Control 


Flow control is needed between each transmitter and the input port that it connects 
to so that an input port does not receive more packets than it has room to buffer. 
The mechanism chosen for this is important because a sub-optimal mechanism could 
needlessly delay packets, which would both add latency, and reduce the throughput 
of the links. This problem is not trivial due to the delay in sending flow control 
information from the receiver to the sender. We may have to turn off the sender 
before the receiver is full so that the data sent after it is told to stop does not 


overshoot the amount of room available to store data. 


The scheme used in PaRC has a simple basis. The input port produces an asyn- 
chronous signal called WAIT, which is sent to the output port. When this signal is 
low the output port can send packets. When it is high the output port should stop 
sending packets. (Of course, any packet it has already started to send should be com- 
pleted.) This mechanism has the advantage of needing only one signal sent between 
transmitter and receiver. Also the logic to implement it is fairly simple. To produce 
WAIT the input port looks at all four of its packet buffers, and asserts WAIT only if 


none of them are currently ready to accept a new packet. 


There is one major problem with this strategy. If it takes too long for the trans- 
mitter to receive the WAIT signal, it will begin sending another packet even though 
there is no more room. The length of the delay between PaRC chips is critical because 
this delay counts twice in determining if the WAIT signal will be received in time: 
A transmitter begins reading out a packet and then sends it out across a link. An 
input port receives the first word and if there is no room for any packets after this one 
it will assert WAIT. This signal is then sent back across the link to the transmitter 
and synchronized to its clock. This synchronized signal must go high soon enough to 


prevent the transmitter from sending a new packet. 


Given the particulars of PaRC’s implementation the link transit time of the data 
plus the link transit time of the WAIT signal must be less than 8 cycles (= 160ns) for 


this to work correctly. Considering the delay of the DLC and the wiring we expect 
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to use, this will allow links over 30 feet long. Although this should be long enough 


for the links we expect to use, we still want PaRC to be able to support longer links. 


An important feature of this scheme is that its performance was optimal; no other 
strategy could do better.®? There are other schemes which could be used, which also 
give good performance, and will work on longer links. One possibility is to have the 
receiver send back a value giving the number of packet buffers currently available. 
However, this scheme would be very costly as it requires sending back several lines 
for data, and possibly a clock as well. A variation of this is to send back a pulse each 
time a packet buffer is freed. This allows the transmitter to determine exactly how 
many buffers the receiver has available. Since the sender would then know exactly 
how many buffers the receiver has available, this would also give optimal use of the 
packet buffers. This requires only a single signal but it has several problems, such 
as what happens if a transmitter misses one of the pulses. Also the logic becomes 
complex since, for example, several buffers can become available at once, but the 


input port can only send a pulse every other cycle. 


To allow longer links, the input port can be programmed to generate a different 
wait signal called LWAIT (for Long WAIT). LWAIT is similar to WAIT except that 
it is asserted whenever there are only 0 or | available buffers. As soon as three buffers 
are full, the input port says to stop sending. If the link is long enough the transmitter 
will send out one more packet before it is stopped by the LWAIT signal. This is fine 
since LWAIT was raised while there was still one buffer available. Using LWAIT the 
allowable round trip transit time is increased by almost 12 cycles. This allows links 


much longer than will probably ever be needed. 


The problem with using LWAIT is that its performance is not optimal since it 


sometimes wastes one of the buffers. This may occur in two places: 


e When the number of available buffers drops from 2 to 1. LWAIT may go high 
in time to stop the next packet, thus leaving one of the buffers unused. 


>It is optimal given the constraints of the fixed delay between sender and receiver and given that 
we do not know in advance when buffers will become available. 


34 


e When the number of available buffers rises from 0 to 1. There is now an available 


buffer, but no packet will be sent to it until a second buffer becomes available. 


The first concern is not as much of a problem as it may at first seem. In light traffic, 
using LWAIT will not hurt much because the input buffers will rarely fill up enough 
to force LWAIT high. In heavy traffic, when a third buffer is put in use, LWAIT will 
be asserted. But since traffic is heavy other packets will have been waiting to use the 
link and by the time the rising LWAIT signal gets to the transmitter, the transmitter 
will have already begun to send another packet. So the fourth buffer will be filled 


anyway. 


This will be true only on longer links, since on short links the LWAIT signal will 
be received before the transmitter begins to send the packet. By using either WAIT 
or LWAIT based on the length of the link we can minimize this problem. On links 
where the rising WAIT signal is guaranteed to be received in time we can use the 
normal WAIT signal. On links where the rising WAIT signal is guaranteed not to 
stop the next packet, LWAIT will be used and if there is a packet waiting it will be 
sent into the fourth buffer. It is only on the remaining links, those where the rising 
WAIT signal may or may not arrive in time, that we lose. (The uncertainty is due 
to the synchronization of the WAIT signal, and to variations in component delays.) 
Since the WAIT signal may not arrive in time we must generate the LWAIT signal. 
But if the signal does arrive soon enough, it may prevent a waiting packet from being 


sent into the fourth buffer. 


To minimize the performance degradations of the above two concerns, PaRC can 
generate a modified LWAIT signal which differs from the normal LWAIT in two ways. 
The first modification is to delay the rising edge of LWAIT by two cycles. On links 
where the LWAIT signal may or may not arrive in time, this should increase the 
delay enough so that all four buffers can be filled. De-assertions of LWAIT are not 
delayed because when LWAIT goes low we want new packets to be sent as soon as 
possible. This has the side effect of reducing the allowable round trip transit time by 


two cycles. But even with this change we can still use links much longer than we will 
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need. 


The second concern above is that when all four buffers are full and one of the 
buffers becomes free, LWAIT stays high. It will not be de-asserted again until there 
are two available buffer spaces. A buffer is now being wasted since there is an empty 
buffer and nothing is allowed to be sent to it. There is a way to improve upon this 
as well. When the number of free buffers increases from 0 to 1, the transmitter is 
allowed to send one (and only one) more packet. This is done by de-asserting LWAIT 
for two cycles. When the transmitter notices that LWAIT is low it will begin to send 
a packet if it has any. It can only send one because LWAIT will be high again by 
the time the transmitter finishes sending the packet. If the transmitter does not have 
any waiting packets during the time when LWAIT is low then no packets will be sent, 


even if one arrives shortly thereafter. 


For both of these modifications there is only a small window of time when a packet 
will be able to be sent into an otherwise wasted buffer. If no packet needs to be 
sent during that time, then the buffer will remain empty. So it is during moderate 
and heavy traffic that these modifications will be most effective; and that is exactly 
when they are most needed. In fact, if there are always waiting packets, then these 
modifications cause the LWAIT flow control strategy to give us optimal performance. 
It is only when the modifications provide a window in which a packet could be sent, 


and one is not, that this strategy is sub-optimal. 


2.8 Routing 


When a packet enters PaRC, the input port must decide which of the 4 output ports 
the packet should be sent out of. This is where the 14 (or 15) ROUTE-DATA bits 
from the header are used. Each PaRC chip can be made to look at two different bits 
of ROUTE-DATA. This allows networks of size 2'* = 16K to be supported; 32K if 


circuit switched packets are not being used. 


The standard way that routing is done is to tell the PaRC chip which two bits of 
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the routing data to look at. Packets going through that chip will then go to the port 
indicated by the two selected bits. With this routing mechanism, butterfly routing 
can easily be done by using the destination address as the routing data. Each stage 
of the network can then route based on consecutive bits of the address. Figure 2.2 
shows one way this could be done in a butterfly network made up of 2x2 switches. 
Since the destination address is 011, the first stage of the network sends the packet 
out port 0, the second stage port 1, and the third stage also port 1. This is true 
regardless of where the packet enters the network. Routing can be done the same 


way with 4x4 switches except that each stage uses two bits for routing instead of one. 


111 
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000 


nodes Stage 1 Stage 2 Stage 3 nodes 


This shows the route a packet would take to 
get to node 011 (from node 101). 


Figure 2.2: Routing in a Butterfly Network 


PaRC is also able to do routing based on local congestion. In this mode only 
one bit of the routing data is looked at. This bit is used to determine if the packet 
should got to a top port (ports 3 and 2) or to a bottom port (ports 1 and 0). If it 
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is determined that the packet should go to a top (bottom) port, PaRC will choose 
the top (bottom) port with the shortest queue of waiting packets. This is called 
Up/Down routing. Of course, this will be useful only in networks in which there are 
several paths that a packet can take to its destination, and in systems where a packet 


may go to any of a group of destinations. 


PaRC can also act as two parallel 2x2 switches. Packets which enter through one 
of the upper two input ports are transmitted out through one of the upper two output 
ports. Which of the two output ports is used is determined by looking at one bit of 
the packet’s routing data. Similarly, packets entering through one of the two lower 
ports are routed to a lower port. A variation of this routes based on congestion. 
Packets entering through an upper (lower) port are sent to the upper (lower) port 


with the shortest output queue. 


There are some minor modifications to the network Monsoon will use that would 
be useful for some machines. Normally the only way one node can talk to another is 
to send a message that travels the length of the entire network. In many machines 
it is beneficial for each processor to have a portion of memory which it is closer 
to. Objects which will be frequently accessed by a processor will be placed in the 
memory close to that processor. Other objects can be distributed as before. Part B 
of Figure 2.3 shows one way this can be done. In this configuration, a processor and 
memory unit are bundled together with a PaRC chip. Each has its own port to and 
from the network, just as in the standard configuration. But since the PaRC chip has 
been added, the processor and memory can now communicate through a short path. 
This will reduce the latency of messages sent between the processor and the memory, 
and reduce the volume of traffic that is sent into the main network. By reducing the 
number of messages sent into the main network, we allow the network to support 


higher performance nodes. 


In some machines it may be desirable to closely bundle three components. For 
example, the Epsilon-2 dataflow machine [5], being designed at Sandia National Labs, 


may bundle together a processor, a memory, and an IO unit. A way to produce such 
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B) Processor and Memory bundled together. 
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C) Processor, Memory, and IO bundled together. 
This reduces the bandwith to each node. 


Figure 2.3: Several Ways to hook nodes into a network. 
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a configuration is shown in part C of Figure 2.3. This allows the three components to 
quickly communicate with one another. Since there is only one port to the network, 
it must be shared by all three components. This decreases the bandwidth that each 
has to the rest of the machine. So to use such a configuration it is necessary that a 


large portion of the communication is to one of the local nodes. 


N N N N N N N N N N N N N N N N 


This shows the nodes, switches and connections in a fat tree. 


The closer a connection is to the top of a tree, the higher its bandwidth. 


Figure 2.4: A typical fat tree 


PaRC would be useful in many networks besides butterfly based networks. One 
example is the fat tree network [7]. A fat tree is a network with a topology of a binary 
tree. The leaves of the tree are the nodes (processor, memory ...); the nodes of the 
tree are the switches. Messages start at the leaves and move up the tree as far as they 
need to®, then they go back down the tree. Figure 2.4 shows a portion of a fat tree 
network. Since more messages need to pass through the higher nodes of the tree, the 


Sach time you move up a tree you double the number of leaf nodes that you are above. Once a 
message is above its destination it need go no higher. 
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bandwidth of each link often increases as you go up the tree. Often there are separate 
links for going up and down the tree, in this case you can think of the up and down 
portions of the network as being separate: A message first moves up the tree via the 
“up” network. Once it has gone far enough it switches to the “down” network and 
is switched down to the appropriate leaf. Figure 2.4 could represent either half of 
this network since they are symmetric. Not shown are the paths between the up and 


down networks.’ 


At each node in this network a one bit routing decision is made. On the trip up 
messages come in from two locations and the switch decides if the message should 
continue going up, or if it should be sent to the down tree. On the trip down, the 
switch decides which of the two sub-trees the message should be sent to. Since each 
stage routes based on one bit, we can easily make this network out of PaRCs. This is 
shown in Figure 2.5. In order to increase the bandwidth we will increase the number 
of PaRCs in a switching node. To double the bandwidth in a link we use double the 
number of PaRC chips at the switching node at the top of that link. (We could also 


increase by other amounts as well, but that is slightly more complex.) 


When we double the bandwidth, each PaRC in a switching node will have one 
input from each subtree, two outputs to the next higher level and two to the down 
tree. (Switch the words inputs and outputs if this is a down tree instead of an up 
tree.) Packets may cross either of the two connections to get to the next level. On 
the up tree, PaRC’s Up/Down routing mechanism would be used to send packets to 
the output with the shorter queue. 


When we do not increase the bandwidth, each PaRC in a stage has one connection 
coming from each sub-tree. In addition, it has one output going to the parent and 
one going to the down tree (assuming we are looking at the up tree). Since only two 
inputs and outputs are used, only half a PaRC chip is needed. Since one PaRC chip 
can act as two parallel 2x2 routers, we could combine two switches. In Figure 2.5 this 


“The top of the tree is not shown and would be handled differently. No switching node would be 
needed as any message coming up one side of the tree would always want to go down the opposite 
side of the tree. 
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A way to implement fat-trees using PaRCs. 
Several PaRCs may be used to implement one switching node. 


Link levels marked by "x2" have twice the bandwith between 
switching nodes as does the link level immediately below it. 


Figure 2.5: A fat tree made of PaRCs 


is shown by dotted lines around two switches that could be combined into one PaRC 
chip. When we combine switches we may be able to make use of PaRC’s Up/Down 
routing mechanism. This is because the outputs of the two switches are often going 


to the same places. 


2.9 Control Port 


PaRC needs a way by which its operation can be controlled. Rather than having 
this done by special control packets, PaRC has a dedicated control port. This port 
does not receive or send packets; instead it operates via a simple read-write protocol. 
This port is used to control various aspects of the operation of PaRC. It also allows 


information about the performance of PaRC to be read out. 
The parameters of PaRC which can be controlled by this port include: 


e How routing is done. 


e Whether the CRC should be checked or generated, and how many words long 
the CRC is. 


e What errors should be checked for. 
e What is done when errors are detected. 
e Whether or not circuit switched packets are allowed. 


e Which ports are operating and which are turned off. 
The performance values that can be read out include: 


e What errors have occurred 
e What input buffers are in use 


e How long each s-fifo queue is 


In addition this port keeps statistics on how the output ports are being used. These 


statistics will tell what percentage of the time each output port spends: 


e Transmitting packets 
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e Idling with no packets to send 
e Blocked due to WAIT being high 


e Blocked due to Circuit Switched Packets 


By looking at these statistics we can get a good idea of how effectively each output 


port is being utilized. 
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Chapter 3 


Implementation of PaRC 


3.1 The Technology 


PaRC is designed in LSI Logic’s LCA10000 CMOS compacted gate array series. The 
die used contains almost 75,000 gates. (A gate is defined as four transistors, the 
equivalent of one NAND gate.) Since routing is done over gates, typically around 
40% of the gates on a chip are usable; the exact figure varies greatly depending on 
the design. PaRC uses over 33,000 gates. 


On chip propagation delays for this technology are fairly fast. A two input NAND 
gate (ND2) in this series has a nominal low-to-high propagation delay of 0.6 ns. 
However, this is not as fast as it seems. This delay is increased due to the wire 
connecting this output to other inputs, and due to the loading of those other inputs. 
When the ND2 gate is driving 4 gates, its nominal delay may increase to about 1.5 ns. 
The above values are the nominal times. The actual delays can vary greatly based on 
process factors when the chip was fabricated, and the environmental conditions (¢.e., 
temperature and voltage) the chip is used in. The best case delays are about 1/2 the 
nominal delays, and the worst case delay are nearly twice the nominal. So the worst 


case delay for the ND2 mentioned above would be about 2.8 ns. 


Of course other gates have different delays. The delay of a 2 input NOR (NR2) 
gate under the same conditions may have a worst case delay of 5 ns, fully one quarter 


of PaRC’s cycle time! There are also high drive gates which are more effective at 
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driving large loads. If the above NR2 gate were a high drive gate (NR2P), its worse 
case delay would improve to about 3 ns. But there is a catch: high drive cells are 


larger, and often have higher input loadings than the normal drive gates. 


Since the delays depend so much on the wiring, exact delays are not known while 
the design is being done. Once the schematics are complete, a top level floorplan of 
the chip is done. This floorplan tells where on the die the top level blocks of the 
design should be placed. LSI Logic uses this as a guide when doing the complete 
layout of the chip. After the layout is done the exact wirelengths are known. These 


wirelengths are used to provide a much better estimate of gate delays. 


The rest of this chapter will describe the implementation of the components which 


make up PaRC. 


3.2 Packet Buffers 


Each input port has its own FMEM block which is made up of four packet buffers, 
along with some multiplexing logic. The packet buffers each store exactly one packet. 
They can only be written by that one input port, but can be read by any of the 
output ports. The multiplexing logic is used to route the data to the appropriate 
output port. The four FMEM blocks account for about half of the gates used on the 


chip. A top level diagram of a single packet buffer is shown in Figure 3.1. 


A major constraint that affected the design of the packet buffers was that all events 
had to be created due to the rising edge of the clock. The incoming clock may have 
been generated in the previous stage’s PaRC chip, so it may be very unbalanced by 
the time it arrives on this PaRC chip. Since there is a very large variance in where 
the falling edge could be with respect to the rising edge, we could not use the falling 
edge of the clock to help generate any signals. Also since the speed of circuits varies 
greatly with process and temperature, we could not reliably use a delay element to 
give us a pulse that lasted only a portion of a cycle. This meant that all pulses had 


to last for one or more full cycles. 
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These signals come 
from the Input Port. 
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Figure 3.1: A Packet Buffer 


In particular, this meant that the write pulse for the memory within the packet 
buffers had to be an entire cycle long. Because we needed to write the ram during 
consecutive cycles, and because the address and data would have to change based on 


the same edge as the write pulse, the memory had to be divided into several banks. 


For these banks there was the choice of using pre-defined memory blocks, or de- 
signing the memory ourselves out of single bit memory cells. The prebuilt memory 
cells are specially laid out to optimize the use of chip area. So they are denser than a 
similar memory built out of discrete cells. Of course, since these are predefined they 
must be used as is. They do not allow the same flexibility as a specially designed 


memory would. 


Using the prebuilt cells would have required a three cycle write phase with the 
write pulse occurring during the second cycle. In order to guarantee the setup and 


hold times for the address, the address would have to appear during the cycle before 


AT 


the write, and continue into the cycle after the write. Since signals can only change 
at the start of the cycle, the address would need to appear at the start of the first, 
and the write pulse could appear at the start of the second. The write pulse would 
not end until after the start of the third, so the address would have to remain valid 


throughout the third cycle. 


By designing the memory ourselves, we can get more flexibility. Each packet buffer 
in PaRC is broken into two blocks, each with six 16-bit locations. Instead of a block 
having an address and a single write line, each location in a block has its own separate 
write line, and each block has a select line. A word is written whenever the word’s 
write line and its block select line are both active. Since we do not have an address 
that must be stable both before and after the write pulse, this arrangement allows a 
bank to be written every other cycle. (Two cycles are needed because the data must 


be stable both before and after the end of the write pulse.) 


Each packet buffer contains the logic necessary for reading packets out of the buffer. 
This allows a packet to be given to an output port as soon as a port needs it. Each 
word in a block has its own separate read line; a word will drive the block’s output 
bus whenever its read line is asserted. The control logic guarantees that exactly one 
of each block’s read lines is asserted at all times. The output of the two memory 


blocks are multiplexed together to provide the data output of the packet buffer. 


The packet buffer has a START signal which tells it to start reading out a packet. 
The buffer waits until it receives the START signal, sends out the 12 words it has 
stored, and then waits until it receives another START signal. When not in the 
process of reading out a packet, a buffer has its two memory blocks reading out their 
first word. The output from the first of these blocks is selected by the mux. This 
allows an output port to read the first word of the packet during the same cycle in 
which it asserts START. On the next cycle the packet buffer flips the select input 
to the mux so that the second word is sent out. At the same time the read address 
for the first block is incremented, so that the third word can be sent out on the next 


cycle. In this way the value to be sent next will always be available before we need 
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to send it. 


When a packet buffer reads out a packet, it must also tell the input port that the 
buffer has been freed. Since the input port may be waiting for another buffer, the 
buffer should be released as early as possible. This can be done before the packet 
is completely read out because it will take several cycles before the input port can 
synchronize the buffer-available signal and start to write a new packet into this buffer. 
It is even safe to have another packet being written into the buffer while the current 
packet is being read out, but only if we can guarantee that no output port will try 
to read the new packet until after the first packet has been completely sent out. The 
signal to release the buffer is given when the eighth word of the packet is being read; 


the reasons why this is safe will be described in the next section. 


The FMEM block, shown in Figure 3.2, contains the four packet buffers associated 
with an input port. The multiplexers in the FMEM block are used to help each output 
port get the data from the appropriate packet buffer. Each output port can read from 
any of the sixteen packet buffers on the chip. This means we need to multiplex the 
data from the 16 buffers so that an output port can read the data from the correct 
buffer. The FMEM block performs the first stage of this multiplexing. The data from 
each packet buffer goes into four 4-to-1 multiplexers. Each output port receives data 
from one of these multiplexers. The output port has direct control of the select lines 
for its multiplexer, so it can select the data from any of the four buffers. Putting the 
multiplexers in the FMEM block cuts down greatly on the amount of area needed for 
placing wires between the packet buffers and the output ports. Instead of having a 
64 (= 16 x 4) bit bus going to all four output ports, there are four 16 bit buses each 


going to just one output port. 


3.3. Input Port 


The Input Port is broken up into a number of sub-blocks. The PACSTART block looks 
at the incoming data to determine when a packet starts. The PACCNT block counts 


AQ 


WRITEs 


DATA 


2x4 


Select 
Ram 


START 


— 


16x2 


ce ee ee 
; i 


Packet 
Buffer 


WRITES 

DATA 

SEL 
Make-Avai 


START 
DOUT 


Packet 
Buffer 


WRITES 

DATA 

SEL 
Make-Avai 


START 
DOUT 


Packet 
Buffer 


WRITES 

DATA 

SEL 
Make-Avai 


START 
DOUT 


Packet 

Buffer 
WRITES 
DATA 


SEL 
Make-Avai 


START 
DOUT 


SEL 


[Signals from output ports] 


[Signals to/from inport ports] 


MAKE-AVAIL 


= 


16 


Mux 4:1 


SEL 


Mux 4:1 


SEL 


Mux 4:1 


SEL 


Mux 4:1 


SEL 


Figure 3.2: The FMEM Component 
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the incoming words of the packet and produces the write signals for the packet buffers, 
while the RDATA block supplies the data for the packet buffers. The AVAILB block 
keeps track of which buffers are available, and the WAITUNIT uses this information 
to decide when to assert Wait. The SELPORT block determines which port the 
packet should be sent to, and the MAKEREQ block makes a request to that port. 
Lastly, the CRC and IDLE blocks check for CRC and IDLE errors. Figure 3.3 shows 
a block level diagram of the input port. 
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Figure 3.3: Block Diagram of the Input Port 


The stage transmitting to an input port sends both data and a clock to the input 
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port. It is this external clock (XCLK) that most of the input port runs off of. The 
data coming into an input port must be aligned with XCLK such that the clock rises 
while the data is stable. To minimize the setup and hold times of the incoming data, 
this data is accessed only in a limited way. When the data is brought onto the chip 
it goes directly into registers, and only via the outputs of these registers will the 
input data be sampled. To allow the register outputs to have some dependence on 
the previous state, scan registers will be used. These are simply registers with a two 
way mux on their data input. The scan registers allows the register to be loaded 
with either the incoming value or an internally generated value. This gives enough 


flexibility while adding only slightly to the data setup time. 


The PACSTART module uses the incoming data to create the signal FIRST. This 
signal is high during the cycle when the first word of a packet is available. This signal 
is used by other blocks which need to know when a new packet is beginning. The 
signal is created by loading the incoming SOP bit into a scan register whenever a 
new packet could possibly begin, and loading a 0 at all other times. This module 
also latches the incoming data into 16 registers to produce the LDIN (Latched Data 
IN) bus. This bus contains the word just received and is used by the CRC and IDLE 
blocks to check for errors. When PaRC is generating the CRC, the generated CRC 
will be latched instead of the last words of the packet. 


The PACCNT block counts the incoming words of the packet using a 12 bit unary 
counter. The output of this counter is used as the write signals for the packet buffers. 
The AVAILB block will produce select signals so that a packet buffer is only selected 


when a packet is being received. 


The AVAILB block produces the AVAIL signal, which tells which of the buffers are 
available. When a packet starts being written into a buffer, that buffer is immediately 
marked as full. When the buffer signals that the packet has been sent, the buffer is 
marked as available. Several cycles before a new packet could be received, this block 
tries to choose one of the available buffers into which the next packet will be written. 


If none are available, then it keeps trying to choose a buffer until it is successful. If 
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a packet starts while no buffers are available, a room error will be signaled. When 
a buffer is chosen, it appears in the PLOC (Packet LOCation) signal. This signal is 
sent to the schedulers so that when a request is made, the schedulers will know which 
buffer the packet is being stored in. Since we choose the next buffer shortly before 
the end of a packet (about 2 words before the end), PLOC will change before the 
packet ends. This is safe to do since we know that the scheduler will have processed 


the request by that time. 


The reason we choose the next buffer earlier than is necessary is to allow packet 
buffers to be released sooner. As mentioned in the previous section, buffers can be 
released before their entire packet has been read out. Releasing a buffer early is 
beneficial because the input port may be asserting WAIT, waiting for a buffer to 
become available. However, we can not release a buffer too early, it must be released 
late enough so that a new packet stored in it will not start to be read out until the 


old packet is completely read out. 


If AVAILB always waited until the last possible moment to choose a buffer, a buffer 
might be chosen just as it was released, and a packet might immediately be written 
into that buffer. The packet buffers would have to release their buffers late enough 
so that this worst-case scenario would operate correctly. By choosing the next buffer 
two cycles before we need it, we can guarantee that a buffer is not used in the two 
cycles after it is released; this allows us to release buffers earlier. When there are 
buffers available, a buffer will be chosen two cycles before it is needed, so obviously 
it will not be used for at least two cycles. However, if no buffers are available when 
we try to select one, a buffer which later becomes available might immediately be 
selected. This will still work because WAIT will have been asserted, so no packets 
will be received (and therefore no buffer will be used) for at least the next two or 


three cycles. 


The AVAILB block also produces the select signals for the packet buffers. Since 
the write line is always active for one of the 12 words, this block must only have a 


buffer selected when we actually want to write that buffer. 
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Figure 3.4: Writing into a Packet Buffer 
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Within a packet buffer there are two blocks of memory. The select line for each 
of these is controlled separately. This prevents glitches when switching from writing 
one buffer to writing another. As soon as the first word of a packet is received, the 
select line for the upper block of the buffer is asserted. The write line for the first 
word will be asserted for this cycle, thus writing the first word. On the next cycle the 
write of word one ends, and the write of the second word begins. This is when the 
select line for the lower block becomes asserted. This allows the second word (which 


is in the lower block) to be written. 


The select lines stay asserted for most of the packet and writes alternate between 
the two blocks. The last word will be written into the lower block. When the write 
of this word begins, the upper select line is deasserted. This is done then because on 
the next cycle the write line for the first word will be asserted again, and we need 
to guarantee that the upper block is deselected before that write begins. On the 
following cycle, during which a new packet may begin, the lower block is deselected. 
As with the upper block, this is done a cycle before the next write to that block may 


occur. Figure 3.4 shows the signals used in writing to a packet buffer. 


The RDATA block produces the RDATA bus, which is the data that is written 
into the packet buffers. Since the data going to the two blocks of memory in a packet 
buffer must change at different times, two buses of data are produced. One goes to 
the upper block of memory in every packet buffer, the other goes to the lower. When 
storing a packet these buses change on alternating cycles. This holds the data stable 
for about a cycle both before and after the end of the write pulse to a location. When 
not storing a packet the first RDATA bus is loaded every cycle because a new packet 


could begin at any time. 


The value for the RDATA bus is produced by scan registers which need to be able 
to load data from three places. The new value for the RDATA bus usually comes from 
the incoming data bus. When the CRC is being generated, however, the last words of 
the packet are instead taken from the CRC block. Also, since the RDATA bus should 


only change every other cycle, when it does not need to change the registers should 
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be loaded with their current value. This is implemented by having the incoming data 
go directly to the data inputs of the scan registers. As mentioned earlier, this is done 
to minimize the data’s setup and hold times. The other input to each register is 
connected to a multiplexer which allows it to get either the register’s current value 


or the CRC value. 


The SELPORT block selects which port the packet should be sent to. It receives the 
first incoming data word from the RDATA block. The routing address is determined 
by selecting two bits from this word. The value of RINF (Routing-INFormation), 
which is a user programmable value in the control port, determines which two bits are 
used. The bits are selected by putting the data word into two 16-to-1 multiplexers; the 
upper multiplexer chooses the upper bit of the address, the lower multiplexer chooses 
the lower bit. Each multiplexer is separately controlled by RINF.' The output of the 
upper and lower multiplexers are used to produce the two bit PORT value, which 


indicates the port that this packet will be sent to. 


Since the SOP bit of the first word is always 1, it is replaced by bits describing 
the location of this input port. So in the input to the upper multiplexer, the SOP 
bit is replaced with the first bit of the input port’s location. Similarly in the lower 
multiplexer the bit is replaced by the second bit of the input port’s location. This 


allows routing to be influenced by which port the packet arrived in. 


To support the Up/Down routing mechanism described earlier, this block is able to 
further modify the lower bit of PORT. This block receives a signal from the scheduler 
for port 3 which tells it which of ports 2 and 3 has a smaller queue. It receives a 
similar signal from port 1 which compares ports 0 and 1. These signals are first 
synchronized to the local clock by being put through two registers. In order to 
do Up/Down routing the lower bit of PORT is put through another multiplexer. If 
Up/Down routing is not being done this multiplexer simply passes the bit unchanged. 


'The value of RINF coming from the control port is used directly and is not synchronized to this 
ports clock domain. This implies that RINF must not be changed while packets are being received. 
Synchronizing RINF would have required a total of 64 registers (16 per input port). Since RINF 
will normally be set once during startup and then remain the same, synchronizing it was not worth 
the additional hardware. 
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If Up/Down routing is being done the bit chosen will depend on the value selected by 
the upper multiplexer: If the upper bit is a one, the Up/Down information coming 
from port 3 will be selected, otherwise the Up/Down information coming from port 
1 is used. This modified lower bit is combined with the upper bit to give the final 
value of PORT. 


The PORT signal is only valid during the first two cycles of a packet. This is be- 
cause the PORT signal is produced directly from the stored first data word. After the 
first two cycles, the registers in RDATA which hold the first word will be overwritten 
with the third word. Also, since the Up/Down signals coming from the schedulers 
may change at any time, the registers producing the synchronized versions of these 
signals are constrained so that they will not change on the cycle after a packet begins. 
This will allow the PORT signal to remain the same for the first two cycles even if 
the Up/Down signals change. 


When a packet begins, the MAKEREQ block will make a request to the output 
port indicated by the PORT signal. The block must first decode the selected port 
into a 4 bit unary value and store that value. It is stored using a latch which is gated 
by the FIRST signal. A latch is used instead of a register so that the data appears 
as soon as possible. Since the PORT value is held for the first two cycles, it can be 
safely latched in by the falling edge of FIRST which will fall at the beginning of the 
second cycle of the packet. When the 7th make-request latches goes high, it means a 
request will be made to the zth port. 


Since requests must be synchronized to the internal clock (ICLK), the rest of this 
block operates on that clock. As soon as a new packet is detected, a latch is set to 
1 signifying that a request needs to be made. The output of this latch needs to be 
synchronized to ICLK. Normally a signal is synchronized by putting it into a register, 
and putting the output of that register into another register. If the first register 
sampled its input as the input was changing, it may take extra time for the register’s 
output to stabilize. But with very high probability it will stabilize in time to be safely 


read by the second register. 
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Synchronizing the request signal is done slightly differently. When a packet begins 
a | is put into the first register of a synchronizer. Instead of having one register as 
the second stage of the synchronizer, four scan registers are used with the output 
of the first register being used as the select inputs of the second stage. These four 
registers are the request registers. A data input for the zth request register is the 
ith bit of the unary port value. When the 1 appears on the output of the first stage 
of the synchronizer, on the next clock edge the request registers will be loaded with 
the unary port value. So one of these registers will go high and the other three will 
stay low. The output of these registers are buffered and sent to the schedulers as the 
request signals. When one of the request signals is asserted, it stays asserted until a 
scheduler acknowledges the request. The second data input to each request register 


is used to accomplish this. 


Note that the request registers are acting as the second stage of the synchronizer 
(synchronizing to ICLK), yet one of their data inputs (the port value) is not synchro- 
nized with ICLK. Most of the time the select signal will be low so the value of the 
unsynchronized data input is unimportant. (The other data input is synchronized, 
and is used to keep requests active until the acknowledgment arrives.) The unsyn- 
chronized data input is only sampled when a new request is about to be made. This 
occurs shortly after a new packet begins. More precisely, the sampling occurs at least 
one full ICLK cycle plus some propagation delay after the rising edge on which a new 
packet begins. When a new packet begins the data input giving the port value is 
guaranteed to get the correct value by 15 ns into the cycle. (This is why a latch is 
used for this value rather than a register.) This means that the data is guaranteed 


to be stable in time to be safely sampled.” 


Error checking is done on packets by use of a Cyclic Redundancy Code (CRC) 
?For the data to be safely sampled the following must be true: 


Max_Time_Until_Data_Valid + DataSetup_Time < Min_-Time_Until_DataSampled 
or 15 + 2.5 < ICLK-Period + Propagation_Delay 
or ICLK_Period > 17.5 — Propagation_Delay 


Since the minimum ICLK period is 20ns, this will always be true. 
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[14]. The 32 bit CRC checksum appears in the final two words of the packet. The 


CRC block can either check or generate this value. 


A CRC code uses polynomial division to check the integrity of messages. It treats 
the &-bits in a transmitted message as the coefficients of a polynomial with & terms. 
It divides (modulo 2) this polynomial by a known generator polynomial. If there has 
not been a transmission error the remainder of this division will be 0. If there has 
been a transmission error then, with high probability, the remainder will not be 0. 
The CRC can be generated by doing a similar division and appending the remainder 


(sometimes called the checksum) to the original message. 


How effective the CRC is at detecting errors depends on the code used. PaRC uses 


the CRC32 polynomial, «*? u 


re ty7ttety?+1, as its generator polynomial. With 


CRC codes, it is possible to prove that certain classes of errors will always be detected. 
When checking a packet for errors, this code will always detect an error when there 
are only 1, 2, or 3 erroneous bits in a packet, when there are an odd number of errors 
in a packet, or when all erroneous bits occur within a span of 32 bits. This last 
property means that if all errors in a packet occur within two consecutive PaRC data 
words, errors will always be correctly detected. For errors which span more than 32 
bits, the probability of an error going undetected is less than 1 in 10°. Additionally, 
when errors occur on only one of a port’s 16 data input lines, this code will always 


detect the errors. 


Since the CRC involves doing modulo-2 division on polynomials, it is very simple 
to compute using only exclusive-or (XOR) gates and registers. To compute the CRC 
32 registers are needed. Initially these registers are set to 0. As the message is being 
received these registers store the partial result of the division. When the division is 


complete they will contain the remainder of the division. 


The simplest CRC circuitry operates on one bit of the message at a time. It 
consists of a 32 bit shift register where most of the register inputs come directly from 
the output of the previous register. Several of the register inputs are computed by 


XORing the previous register’s output with the output of the most significant register. 
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The locations with XORs correspond to terms in the generator polynomial with non- 
zero coefficients. This implements one step of the division wherein the generator 
polynomial is subtracted only if the current MSB is one. This can be expanded to 
allow n bits to received at a time, and the results of n potential subtractions to 
be computed at once. This results in each register input being the XOR of several 
values. PaRC receives 16 bits of data at a time, so the CRC generater was designed 


to consume 16 bits per cycle. 


One way to check for errors is to see if, after all words are received, the remainder 
is 0. It is possible to check for this slightly sooner. During the last two words of the 
packet, PaRC compares the input data to the current upper 16 bits of the partial 
result. No differences will be found iff the remainder will be 0. Since we don’t need 
to look at the remainder, we can synchronously reset the registers to 0 so that the 


CRC can be checked for the next packet. 


To generate the CRC we could simply process the message as above, and the 
remainder will be the CRC. As in CRC checking, this is modified so that only the 
upper 16 registers need to be examined. After we have received the first 10 words of 
the packet the upper 16 registers will contain the first part of the remainder, and the 
lower 16 the second. By replacing the current CRC data input (which is meaningless 
anyway) with the upper part of the remainder, on the next cycle the lower part of 


the remainder will be shifted into the upper 16 registers. 


This gives us the first (second) part of the CRC in these registers at the same 
time the 11th (12th) word of the incoming packet would normally be available. But 
we want to be able to use the first (second) part of the CRC in place of the 11th 
(12th) word. We would like other blocks to be able to simply latch in the CRC word 
instead of the incoming data word. To do this we need access to the CRC one cycle 
earlier. This is done by sampling the inputs to the upper 16 registers, instead of their 
outputs. This value is sent to other blocks which can latch it at the same time they 
would latch the 11th (or 12th) word of the packet.* 


3The paths where these values are sent to the RDATA block are the longest paths in the input 
ports. These paths are about 4ns faster than they need to be for PaRC to run at 50 MHz. 
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The IDLE block checks each incoming word to see if it is identical to one of the 
two idle patterns (hex 5555 and 2AAA). If a word is not one of the idle patterns and 


it is not part of a packet, then the idle-error signal is asserted. 


3.4 Scheduler 


Each output port has its own scheduler which tells it which packet should be trans- 
mitted next. The major components of the scheduler are shown in Figure 3.5. The 
scheduler receives request signals from the input ports and processes them by storing 
the requests on a queue. When requests are waiting in the queue, the scheduler in- 
forms the transmitter of this by asserting MORE. While MORE is asserted, PacInfo 
should contain the information about the packet that should be transmitted next. 
After the transmitter begins transmitting a packet, it will assert NEXT. This means 


it wants to find out about the next packet to be transmitted. 


An important goal in the implementation of the scheduler was to make it fast since 
time spent scheduling adds to the latency of non-blocked packets. The latency of the 
scheduler is one cycle: When the transmitter is waiting for a packet to send, and a 
request is made, on the next rising clock edge the transmitter will be able to latch in 


the information about that request. 


The scheduler receives a request signal from each of the input ports. These signals 
have already been synchronized by the input ports so that they will only change at 
the beginning of the cycle. Since several input ports may make a request at the same 
time, the scheduler must choose between them. For simplicity, the scheduler uses a 
fixed priority scheme: The input ports are given a fixed priority (3,2,0,1) and the 
request with the highest priority will be selected. This scheme is slightly inequitable 
since whenever two ports make concurrent requests, the same port will be chosen 
every time. A fixed priority scheme is allowable because we can guarantee that the 
lowest priority port cannot be “starved.” If its request is ignored in favor of one 


from a higher priority port, the request is guaranteed to be chosen before the higher 
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Figure 3.5: Block Diagram of the Scheduler 


priority port can make another request. 


A fixed priority scheme was used because of its speed. With this scheme, the 
requests need only go through two levels of logic (an inverter and a nand gate) to 
select the chosen request. A fairer scheme would have been slower and required either 
lengthening PaRC’s clock period or pipelining the scheduler (thereby increasing the 
latency of non-blocked packets). The slight benefits of fairer scheduling were not 
worth either of these costs. For simplicity, each scheduler uses the same priority 
list. (t.e., Every scheduler would choose port 3 over port 2 if both made requests 
at the same time.) Giving each scheduler a different priority list would have helped 
to remove some of the unfairness of a fixed priority scheme. Since the scheduling 
algorithm is otherwise first-come-first-served, this unfairness will not have much of 


an effect on PaRC’s performance. 


Once a request is chosen we need to select the rest of the request information from 
the input port that made the request. The chosen request (encoded in unary) is used 
as the select inputs to a mux to select the packet information bits from the correct 
input port. From these muxes the scheduler gets 2 bits which tell which of the input 
port’s buffers was used to store the packet, and one bit which tells whether or not 
the packet is a circuit switch packet. As this is done the selection lines are encoded 
into two bits which indicate which input port the packet is stored in. These 5 bits, 
which will be stored in the s-fifo, tell all that the transmitter needs to know about 
the packet. 


At the heart of the scheduler is the S-FIFO and its associated counters. The s-fifo 
is a 17 by 5 array of ram cells. One 5 bit request can be stored in each of the 17 
locations. Each of these locations has its own separate write line (actually write and 


nwrite lines) and read line. 


The scheduler maintains two counters as pointers into the s-fifo. The input counter 
isa 17 bit unary counter which identifies the next location to be written. The outputs 
of this counter are directly used as the write lines for the ram. Since this counter 
always points to the next location to be written, every cycle will do a write of the 
next available fifo location. If no request is being made during a cycle then the value 
written will be meaningless. When a request is made a correct entry will be written 
into the ram and the value of the input counter will be incremented on the next clock 


edge; thus preserving the entry. 


Since the write of an s-fifo location will not end until after the next cycle begins, 
there is a 5 bit latch at the data input of the s-fifo. This holds the s-fifo data input 
stable during the high phase of the clock, thus keeping the data stable until well after 
the write has ended. 


Although there can only be 16 valid entries in the s-fifo at any one time, the s-fifo 
has 17 locations. This is because we always do a write of the next location. By having 


this extra location we are able to greatly speed up, and simplify, the control of the 


s-fifo. 
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The second counter, the output counter, always points to the next request in the 
s-fifo. This counter value is used as the read address of the s-fifo, so that the s-fifo 
is always reading out the next request. This request information is passed on to the 
transmitter as PacInf. Since the transmitter needs this information early in its cycle, 
the transmitter will latch in this information at the beginning of any cycle during 
which it may want to start reading out a new packet. (This means that PacInf must 
be valid in time to be safely latched on a rising clock edge.) After the transmitter has 
used this information it will assert NEXT. This will cause the output counter to be 
incremented so that the next request will be read. (This is done several cycles before 


the new request is actually needed.) 


Since we begin to read each s-fifo location before its value is needed, the value can 
be read out in plenty of time to be presented to the transmitter. This is not the case, 
however, when the s-fifo is empty. When a request is made to an empty s-fifo, the 
request would need to be written into the ram, and then read out soon enough for 
the transmitter to latch it in at the beginning of the next clock cycle. The s-fifo is 
not fast enough to do this; instead we use a cut-through technique. When the s-fifo 
is empty, instead of using the value being read from the s-fifo for PacInf, we use the 
data that will be written into the s-fifo. This allows the value of PacInf to reach the 
transmitter in time. Even when this is done, we still write the request into the s-fifo. 


On the cycle after this request is made, we will go back to using the s-fifo output. 


The signal MORE is used to tell the transmitter when there are more packets to 
be sent. If it is high during a cycle then if PacInf was latched at the beginning of the 


cycle a valid request was latched. 


To compute MORE the scheduler needs to know how many packets it has in its 
s-fifo. A third counter, this one an up-down counter, is used to keep a running total of 
the number of packets in the s-fifo. If this counter is greater than 0, or if a request is 
being made, then MORE will be asserted on the next clock edge. Instead of this third 
counter we could have compared the first two counters to see if they were the same. 


The extra counter is used because the count is also needed for up-down routing. 
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In up-down routing we need to determine which of ports 0 and | (or 2 and 3) have 
the shortest queue. This is done by having the scheduler for port 0 (and 2) send its 
current queue length to the scheduler for port 1 (3). That port can then compare the 
lengths to see which is greater. Schedulers 1 and 3 then send this information to the 
four input ports. These schedulers also send an s-fifo count to the control section so 


that the s-fifo count can be made visible. 


3.5 Transmitter 


The transmitter is the component that is responsible for transmitting packets from 
the chip. The scheduler gives it information about the next packet to be transmitted. 
It must then set up a connection to the packet buffer containing the packet, tell that 
buffer to begin reading out its data, and send out the data. Between packets, idle 


patterns must be transmitted. Figure 3.6 shows an overview of the transmitter. 


3.5.1 Outgoing Data 


To ensure that the transmitted data is valid for the longest possible time, the 16 bits 
of outgoing data come directly from 16 registers. These registers are loaded with 
either the data coming from the packet buffers, or with an idle pattern. This section 
must also produce the clock that accompanies the transmitted data. This is done by 
taking ICLK (which also clocks the registers) and putting it through an appropriately 
sized inverter. This provides a clock which rises while the data is stable. We must 
guarantee that the clock rises at such a time that it will provide sufficient setup and 


hold times. 


There are a number of problems which limit our ability to maximize the setup 
and hold times. One is clock skew. By placing an output port’s data registers and 
clock inverter in the same area of the chip, the clock skew between them is kept 
fairly small.* A bigger problem is the variation of delays with process, voltage and 


4The skew of ICLK is as great as 2.3 ns between some components. But considering only the 16 
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Figure 3.6: Overview of the Transmitter 


temperature. The worst case delay of an element is almost four times greater than 
its best case delay after accounting for these three variations. The variations due to 
temperature and voltage are the least important of these. This is because they have 
less of an absolute effect on the delays than process variations, and because their 
changes affect the delays of all gates similarly. Process variations can account for 
delays varying by about a factor of three. This effect is made worse because all gates 
are not affected the same by process variations. Gates which pull up and those which 
pull down are created by different processes. So it is possible, for example, to have 
gates with worst case rise times on the same chip as gates with best case fall times. 
The worst case combinations must be considered when determining the true setup 
and hold times provided. 


single bit registers and the clock inverter in any one output port, the skew between any 2 components 
is only 0.4 ns. 
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3.5.2. Transmitter Logic 


The control section of the transmitter is made up of logic which counts how much 
of the current packet has been sent out, and determines when a new packet can be 
begun. A new packet can begin if 1) No packet is currently being sent out. 2) MORE 
is high, indicating that there are additional packets waiting to be sent out. 3) WAIT 
is low, indicating that the receiver can accept another packet. 4) This output port 
has not been turned off by the controller. 5) If the next packet is a circuit switched 


packet, the output port is not currently waiting for an acknowledgment. 


To shorten the latency of packets, the transmitter is able to begin sending a packet 
(i.e., load the output register with the first word of the packet) on the same cycle 
it determines that a new packet is allowed to begin. As mentioned previously, to do 
this the transmitter must be able to read the first word from a packet buffer in the 
same cycle that it finds out which packet buffer should be read from. This is done 
by latching in PacInf (the signals from the scheduler which tell where the packet is 
stored) at the beginning of any cycle during which a new packet might begin. Two 
bits of these latched values are sent to the select inputs of multiplexers in each of 
the four FMEM blocks.® This selects the data from one packet buffer in each FMEM 
block. Two more of the bits are used locally as inputs to 4-to-1 multiplexers which 
select between the data coming from the four FMEM blocks. In this way data from 
any of the 16 packet buffers can be selected. 


The packet buffer in which the new packet is stored must be told to start reading 
out the rest of its data. To do this the address from the stored PacInf is sent to 
PaRC’s control section, along with a signal which goes high only when a packet 
actually begins. The control section receives this start information from all four 
output ports and uses it to generate START signals for the 16 packet buffers. While 
a packet is being transmitted, the transmitter will assert NEXT for one cycle. This 
tells the scheduler to send information about the next packet to be read. 


>These signals are the beginning of the longest paths on PaRC so it was important to minimize 
their delays. The Q output of the registers are buffered and sent to the two most distant FMEM 
blocks. The QN outputs of the registers are buffered and sent to the two closest FMEM blocks. 
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The RELPROC block of the transmitter can generate or process acknowledgment 
signals for circuit switch packets. This block receives the output port’s incoming 
acknowledgment signal and produces four acknowledgment signals, one for each input 
port.® To be able to produce these acknowledgment signals, when a new circuit switch 
packet begins, this block latches in the part of PacInf which tells which input port 
the packet came from. When it becomes time to give the acknowledgment, this value 


is used to route the acknowledgment to the correct port. 


If the PaRC chip has been told to generate acknowledgment signals, then a 2 cycle 
acknowledgment pulse will be given shortly after a circuit switched packet begins 
to be sent. If it is not supposed to generate the acknowledgment, then it will wait 
for an incoming acknowledgment. As soon as this acknowledgment arrives (i.e., as 
soon as the incoming acknowledgment line is asserted) the outgoing acknowledgment 
pulse begins. The outgoing acknowledgment pulse will end 1 to 2 cycles later. Once 
this is done the control section of the transmitter is told that it is no longer waiting 
for an acknowledgment; this will allow circuit switched packets to be sent again. 
The start of the incoming acknowledgment pulse is used to both start and end the 
outgoing acknowledgment pulse. The unsynchronized version is used to start the 
outgoing acknowledgment pulse in order to minimize the delay of an acknowledgment 
passing through a PaRC chip. The synchronized version is used to end the outgoing 
acknowledgment pulse in order to guarantee that an acknowledgment pulse does not 
grow too long or too short. If the end of the incoming pulse was used to end the 
outgoing pulse then the pulse might get shorter (or longer) each time it passed through 
a PaRC chip. Using only the beginning of the pulse guarantees that the length of 
each pulse remains within a fixed range. 


5When an acknowledgment signal is asserted for an input port, that input port’s RELOUT output 
will be asserted. 
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3.6 Control Section 


The control section performs a number of functions which affect the operation of the 
entire chip. In particular, it implements the control port which allows a network 
control system to control and monitor the performance of PaRC. The control section 
also keeps track of any errors that occur and keeps statistics on how the output ports 


are being utilized. 
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Figure 3.7: The Control Port 


An overview of the control section is shown in Figure 3.7. The REGS block contains 
registers which store the values which control the function of PaRC. The WRITE- 
REGS section controls the writing of the values in REGS. The ERRMOD block keeps 
track of errors that have occurred, and takes any appropriate action. The STATS 
block keeps statistics on the use of PaRC’s output ports. These statistics are stored 
in a 16x24 ram. The READ-CP section handles reads of the control port. The values 


69 


read may reside in the control port or may come from elsewhere on the chip. 


To the user, the control port looks like 64 locations which can be read or written. 
The first 8 of these are used for setting various values which control the operation of 
PaRC. The next 8 are used solely for reading out values about PaRC’s current state. 
The final 48 are used for reading (and writing) the statistic values. See the User’s 


Guide for details on the use of each specific location. 


The control port interface is made up of 6 address lines (CAD), an 8 bit bidirec- 
tional data bus (CDATA), an output enable for the data bus (C_OE), and a write 
line (CR_W). A location is read when an address is given on CAD, and the output 
data bus is enabled. A write is done whenever the write line is brought low. The 
address lines should be kept stable during this time. If a specific value needs to be 
written, the output data bus should be disabled, and the value driven onto the bus. 


The WRITE-REGS section controls the writing of the values in REGS. A write 
should be done whenever the write line is asserted and the value on the address bus 
is less than eight. The incoming write line is inverted and used as a clock to store the 
values needed for the write. Whenever the write line is asserted this will store the 
data bus, the lower three bits of the address bus, and a bit which tells if the upper 
three bits of the address bus were all 0. If the upper three bits are all 0 then the 
address is less than eight and a write to the REGS section will be done. The write is 
done by producing a one cycle pulse on one of the eight write lines going to the REGS 
section. The above data and address values are latched in so that even if the input 
lines change after the write begins, a valid write will still be done. This prevents a 


bad write from corrupting many values in the REGS block. 


The REGS section contains the registers that store the control values. Each of 
the eight locations receives its own write line. Five of the eight locations store read- 
able/writable values. Some locations use latches with their enable connected to the 
write line. Others use registers which are loaded on clock edges in which their write 
line is asserted. The registers take up more space than latches, but they are used 


because they produce outputs that only change at the beginning of a cycle. The 
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other three locations are write only locations. When a write is done to one of these 
locations, some command is executed. One of these locations simply produces a pulse 
to clear the error counters. The other two are used to produce pulses about six cycles 
long. One is used as a reset command, the other is sent to the data link chips to tell 


them to synchronize to the PaRC chip. 


The ERRMOD module keeps track of the errors that have occurred. There are 12 
two bit counters used to count the number of CRC, IDLE, and ROOM errors that 
occur in each input port. These counters start at 0 and are incremented by one each 
time an input port signals a particular error. Once they reach their maximum, further 
increments will have no effect. The counters count using a gray code, which means 
that only one bit of the output will change at a time. This is done so that if one of 
the counters is read while it is being incremented, a valid value will still be obtained. 
The link errors do not use counters, but have just a single bit that says whether or 
not a given link error has occurred on a given output port. The error values can be 


reset by doing a write to the reset-error-counters command location. 


This module also produces PaRC’s ERROR output. This output goes high when 
any of the error values are non-zero. The gray encoding used by the counters prevents 
glitches from appearing on this signal when a counter increments. This module also 
produces the PASSIVE signal. When this signal is asserted the input and output 
ports will go passive. (i.e., Output ports stop sending; input ports assert WAIT.) As 
mentioned earlier, PaRC can go into passive mode when an error is detected. Settings 
in the REGS section tell whether all errors, all errors except link gray errors, or no 
errors at all will cause PaRC to enter passive mode. PASSIVE is computed from the 


error values, synchronized to ICLK, and then sent to the input and output ports. 


The READ-CP section handles reads of the control port. Mostly this involves 
multiplexing between the data stored in other sections of the control port. Values 
can be read from the REGS, ERRMOD, or STATS section of the control port. Some 
of the values that can be read come from the input and output ports. For these a 


signal is sent to the ports to select what data they send to the control port. The 
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selected data is sent out whenever the C_OE signal is low. 


There are 16 statistic values, each 24 bits long. When a read is done of a statistic 
location, the lower 4 bits of the address are used to determine which value is read; the 
upper 2 bits are used to determine which 8 bits of the 24 bit value are returned. Since 
the STATS module controls the ram while statistics are being collected, statistics 
cannot be read during this time. If they are then the value currently being output 


from the ram will be returned, regardless of which location was requested. 


3.6.1 Statistics 


The STATS section, shown in Figure 3.8, is used to keep statistics on how each 
output port is being used. It keeps track of how much time the outputs spend 
in the following states: transmitting (XMIT), idling (IDLE), waiting for a circuit 
switch acknowledgment (CSWAIT), and waiting for the WAIT signal to be deasserted 
(WAIT). These sixteen values (four per output port) are stored in a 16-by-24 ram. 
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Figure 3.8: The Statistics Section 


The STATS module can be in one of two modes: AUTO or Read-And-Clear mode. 


i 


When in AUTO mode the module will collect statistics by repeatedly reading and 
updating ram locations based upon the current state of the output ports. When in 


Read-And-Clear mode, the ram is under the control of the control port. 


When in AUTO mode, the stats module determines the state of an output port, 
increments the value corresponding to that state, and then moves on to the next 
output port. This update process takes three cycles, so each output port is sampled 
every 12th cycle. (Since packets are 12 cycles long, each output port’s XMIT statistic 
will give a count of the number of packets that port has transmitted.) During the 
first of the three update cycles the new ram address is latched in. The first two bits 
of this ram address correspond to the next port to be updated, the second two bits 
correspond to the state of that port. The next port to be updated is determined 
directly from the previous port. The state of the next port is determined in advance 
and stored in a register. This ensures that the next ram address is ready to be latched 


in during the first cycle of the update. 


The new address is sent to the ram and the read of the ram begins. At the end 
of the second cycle, the 24 bit value returned from the ram is stored in a register. 
During the third cycle the stored ram value goes from the register, through a 24 bit 
incrementer and to the data input of the ram. Also during the third cycle the write 
pulse is given to the ram. At the beginning of the next cycle (which is the first cycle 
of a new update) the write pulse will end. The ending of the write pulse is used to 
trigger the load of the new ram address. This ensures that the ram address stays 


valid until the write has completed. 


All of the signals sent to the ram first go through 2-to-1 multiplexers. When not 
in AUTO mode, these multiplexers allow the ram’s input to be controlled through 
the control port. The ram’s address will be taken from the lower bits of the CAD 
bus. The ram’s write pulse is generated from the CR_W, the control port’s write 
signal. When a write pulse appears on CR_W, it will be passed on to the ram if the 
address on CAD is greater than 15. (An address of 16 or more means that a statistic 


location is being addressed.) When not in AUTO mode the data input to the ram 
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will normally be given the value 0. This allows ram locations to be initialized when 
they are written to. For testing reasons, when not in AUTO mode, the ram data can 


also be taken from either the CDATA bus or from the incrementer. 


The 24 bit incrementer is built out of six submodules which are 4-bit incrementers. 
Each 4-bit incrementer takes in 4 bits of data and a carry-in signal. They each produce 
a “group carry” signal and 4 bits of the result. The group carry output is asserted 
iff all four of the data inputs to that submodule are ones. The carry-in to each 
submodule is asserted if all the submodules before this one have their group-carry 
asserted. (Additionally all carry-inputs can be forced high for testing purposes.) 
When carry-in is asserted a submodule will increment its data input, otherwise it 
will output it unchanged. The longest path through this incrementer is only 4 gates. 
The value going to the incrementer comes from a register which is loaded on ICLK. 
The time from the rising edge of ICLK until the incrementer’s output is stable is less 
than 12.5 ns (assuming worst case process, temperature and voltage). This needs to 
be fast in order to meet the data setup time for the ram. The write pulse going to 
the ram begins based on the same rising clock edge which loads the incrementer’s 


register. This write pulse will end one cycle later. 


When collecting statistics, the statistic values must be sampled and reset regularly, 
or they may overflow. If the incrementer overflows it simply wraps around to 0. Since 
values are 24 bits long and can be incremented only on every 12th cycle, overflow 
could occur after 274 * 12 cycles. At 50MHz, this is approximately every 4 seconds. 
To make it easier to read the statistics before they overflow, PaRC can produce a 
Statistics-Half-Full (SHF) signal. When statistics are being collected, this signal will 
be asserted when a stats location is written with a value that is more than half the 
maximum value. When the network control system sees this signal asserted it can 
stop statistic collection and read out the statistic values. It can then reset the values 


and restart statistic collection. 
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3.7 Other Details 


This section describes several other important characteristics of the implementation 


of PaRC. 


3.7.1 Single Packet Latency 


Minimizing the single packet latency was an important goal of this design. The 
latency of this design is approximately 100ns. To measure this time we will calculate 
the delay from when the incoming clock rises (with the first word of the packet on the 
input data bus) until the first word of the packet is on the output data bus and the 
outgoing clock rises. The time from XCLK going high until the new-packet signal is 
set up at the synchronizer input is under 14 ns. When this signal is detected by the 
first stage of the synchronizer depends on how the incoming XCLK is aligned with 
ICLK. This may take up to one cycle. One cycle after this the request signal will be 
generated, and one cycle later the MORE signal will go high, telling the output port 
that a packet is waiting to be sent. One cycle later the output registers will be loaded, 
and the first word of data will be placed on the output bus. Half way into the next 
cycle (when ICLK falls) the rising output clock will be generated, and will appear 
outside the chip within 5 ns. Summing up these times gives us 34 ICLK cycles, plus 
19 ns, plus 0 to 1 cycles at the synchronizer input. This is less than 4 > to 5 > cycles, 


which is equivalent to 100 + 10 ns when PaRC is running at 50MHz. 


3.7.2. Critical Paths 


The design goal was for PaRC to be able to run at 50 MHz. This meant that all 
register to register paths would have to be under 20 ns. When determining if a path 
meets the 20ns limit, there are two factors that must be taken into account. The first 
is the actual delay of the logic, the second is time lost due to clock skew. Since a 
single clock driver is driving the clock signal to all gates on the chip, the clock will 


arrive at different locations at different times. If the clock going into the register at 
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the end of a path arrives before the clock going into the register at the beginning, 
then the difference in the clock arrival times (i.e., clock skew) must be subtracted 
from the 20ns limit. Approximate clock skews are not known until after layout, but 
the skews are usually under 2 ns. When the chip was being designed all paths were 
kept under 18ns. This was to allow margin for clock skew, or for any delays that 


might have been longer then originally predicted. 


The longest paths in PaRC are the paths used to set up the multiplexers in the 
FMEMs. When a new packet is begun in an output port, a packet’s address is latched 
into registers at the beginning of a cycle. The outputs of two of these registers are sent 
to all four FMEMs to control the multiplexers which send data to the output port. In 
each FMEM these signals are buffered and used as the select input to the multiplexers. 
Once the select inputs are available, these multiplexers select the appropriate data 
and send it to the output port. The output port then uses additional multiplexers 
to choose between the data coming from the four ports. The chosen data then goes 
into a register. The length of these paths was kept under 18 ns. This was done 
by using some extra hardware to speed up the first part of the paths (7.e., the part 
that produces the mux select signals). Speeding up the first part of a path was more 
economical since it consists of only a few signals, whereas the second part consists of 


a bus of 16 signals. 


Although these paths go through several blocks, each path ends in the same block 
it began in. This means that the beginning and ending of a path will be in the same 
area of the chip. For this reason we expected the clock skew on these paths to be 
small. This turned out to be the case. After layout was done, all critical paths on 


the chip were under 174 ns, even after accounting for clock skew. 


3.7.3. Clock Frequency Differences 


This design does put some constraints on how closely the clock frequencies of two 
adjoining stages must be matched. One such constraint occurs in the reading and 


writing of packet buffers. A packet can be read out of a packet buffer while it is still 
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being written in. ICLK, which is used to read out the packet, must not be too much 
faster than the XCLK which is used to write the packet. If it were, we could try to 


read part of a packet before it was written. 


From the time the first word is received, it takes just under 12 XCLK cycles for 
the last word to be written into the packet buffer. From the time the first word is 
received, it will be over 2 ICLK cycles before a transmitter begins to read that packet. 
Reading out the last word will not begin for another 11 cycles. So to guarantee that 
the last word is written before it is read, the time for 13 ICLK cycles must be more 
than the time for 12 XCLK cycles. In other words, the XCLK period must be at most 
13/12 = 1.08 as long as the ICLK period.” There are a number of other interactions 
between ICLK and XCLK that put constraints on their relative timings, but the 


constraint described above is the most restrictive. 


’This analysis uses conservative timing numbers. Doing a more precise analysis shows that XCLK 
can be as much as 1.13 times as long as ICLK. 
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Chapter 4 


Test Vectors 


4.1 The Need for Test Vectors 


After chips are fabricated it is necessary to test them to determine if they were 
fabricated correctly. A set of fabrication tests is needed to check if the logic on the 
chip functions as the circuit design says it should. This is not the same as testing to 
see if the design is correct. That is a very different question. Testing to see if the 
design is correct would involve comparing the high level specifications of the chip to 
how the design actually functions. But for these fabrication tests the specifications of 
the design are irrelevant. This test must compare the actual silicon to the silicon that 


was specified by the design (regardless of whether or not that design is “correct” ). 


This test is done by producing a series of vectors which tell what inputs should 
be applied to the chip, and what results the chip should produce. Automated testing 
equipment will cycle through these vectors in order. On each cycle it will read the 
test vector, apply the appropriate inputs, and check to see if the chip outputs match 
the predicted outputs. If the actual outputs do not match the expected outputs then 
the vector is said to fail. The job of producing these vectors is known as test vector 


generation. 


There are two goals of test vector generation. The first is to make sure that vectors 
will fail only if the chip is defective. This goal is non trivial because there is more 


variance in the actual operation of the chip, than in its simulation. Consider the 
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simple case where a data input is changed shortly before a clock input is changed. In 


the simulation this may work fine because the data always arrives before the clock. 


However, when the test is actually run there will be some uncertainty (about +4 ns) 
in when the tester actually sends any given signal to the chip. If the data is skewed 
late and the clock is skewed early, the data may not arrive in time and the vector will 
fail even though the chip is not defective. There are also large uncertainties in the 
delays inside the chip due to process variation. For both these cases, we must ensure 
that vectors will produce the same results whenever the delays fall anywhere within 
the allowable range. A number of methods are available to help ensure that this goal 
is met. For example, we can run simulations where some inputs are changed earlier 
or later than they should be. This imitates the skew that could be introduced by the 
tester, and could point out problems that tester skew could create. By using these 
and other tests on the test vectors, we can be fairly confident that a test vector will 


fail only if the chip is defective. 


The second goal is to construct the vectors such that if there is a defect anywhere 
on the chip, then some vector will fail. This is a much harder goal to meet. This is 
difficult both because of the large amount of logic on the chip, and because some of 
that logic is hard to control directly. We must also try to use as few test vectors as 
possible since we can only use a limited number of them. (LSI Logic, for example, 
recommends limiting the number of test vectors to 16,000. If significantly more are 
needed, they can be added for an additional charge.) It is for these reasons that the 


job of test vector generation can be so time consuming. 


4.2 Testing for Defects 


To check for defects in the chip we need to have a model of what defects may do to 
the operation of the chip. A useful model is the “stuck at” model. In this model 
a defect is assumed to cause a node on the chip to always be zero or to always be 
one. The goal of test vector generation is to produce vectors that will find any single 


stuck-at-zero or stuck-at-one fault. Of course other types of faults are possible; for 
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example two signals might be unintentionally connected. Also, it is possible to have 
more than one stuck-at fault on the same chip. Although we detect any single stuck- 
at fault, this does not guarantee that we will detect a problem if two faults occur. 
However if we can detect any single stuck-at fault, it is very probable that we will 


detect any type of fault or number of faults that might occur. 


To detect a given stuck at fault, we must present inputs such that if that fault 
exists the output values will differ from what they would be if the fault did not exist. 
Let’s first look at the simple example of testing a 2 input nand gate. We want to 
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A) Test 1 QO Stuck-at-1 
Stuck-at-0 1 
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0 
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D) Test 4 J] Stuck-at-0 
1 


Figure 4.1: Testing a NAND gate 


check both its inputs and its outputs for any stuck-at faults. For a 2 input NAND 
gate there are only four possible input sets. Figure 4.1 shows the four possible tests 
and what faults each test would detect. Part A of the figure shows the test where ones 
are applied to both inputs; the result should be a zero. If either input were stuck at 


zero, then the output would be a one, so the vector would fail. Similarly if the output 


os) 


0 


were stuck at one, then the result would be incorrect and the test would fail. Thus 
this test checks for stuck-at-zero faults on each input, and a stuck-at-one fault on the 
output. Part B of the figure shows zeros applied to both inputs with the expected 
result being a one. This test checks only for a stuck-at-zero in the output. Even if 
one of the inputs was stuck at one, the output of the gate would not be affected. Part 
C shows the test where a one is applied to the first input, a zero to the second, and 
a one is expected at the output. If the second input was stuck at one, both inputs 
would be one, so a zero would be produced and the test would fail. If the output 
was stuck at zero the test would also fail. The first input is not tested; even if it had 
an incorrect value (zero instead of one), the output would not have changed. The 
final test, shown in part D, is similar to the previous test except that the inputs are 


reversed. 


If we wanted to create a set of test vectors for this NAND gate, we would not have 
to include all four tests. We need only the minimal set of tests such that all faults 
are checked for. In this example tests A, C and D form such a set. Test B is not 
needed since it only detects output-stuck-at-1 faults, which two other tests also check 
for. Carefully choosing the tests can reduce the number needed; this is especially 
true with gates with many inputs. For example an n-input NAND gate can be tested 


using only n + 1 of the 2” possible inputs combinations. 


Figure 4.2: A NAND-OR circuit 


In the above example we assumed that the output of the NAND gate was directly 
visible, but this is not always the case. Consider the case shown in Figure 4.2 where 
the output of the NAND gate is put into an OR gate. Now there are further con- 
straints on our test of the NAND gate. If the top input to the OR gate is one, then 
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the output of this circuit will always be one. Any attempted tests of the NAND gate 
will be useless, since the NAND gate will have no effect on the output of the circuit. 
If the top input to the OR gate is zero, then the output of the circuit will be the 
same as the output of the NAND gate. So to test the NAND gate, we must do the 
three tests chosen above while the upper input to the OR gate is zero. This is easy 
to do if the top input to the OR gate is directly controllable, it takes more effort if it 
is produced by other logic. When a change in a gate’s output will cause a change in 
the output of the chip, we say that the output is observable. In general, whenever a 


gate is tested we must make certain that its output is observable. 


When circuits also include memory elements, the task of checking for faults be- 
comes even more difficult. When the input to a gate is influenced by the state of a 
register, then all tests of that gate are influenced by the previous test vectors. Before 
doing a test of that gate we will have to load the register with a specific value. This 
may take just one vector, or it may require many. If the output of a gate goes to a 
register, it becomes more difficult to ensure that when the gate is tested, its output 
is observable. If the output of the register goes directly off chip, then for the gate’s 
output to be observable, we must simply make sure that the register is loaded during 
the cycle in which the test is done. A fault in the gate would then show up as an 
incorrect output one cycle later. Usually, however, the register does not go directly 
off chip, but instead feeds other logic. So to test a gate, we must also control the rest 
of the chip both before and after the fault is tested for. The chip must be manipu- 
lated such that if the gate has a fault, a changed gate output value will work its way 
through the logic and registers, and eventually cause a changed chip output. Again, 
this changed chip output may happen shortly after the faulty gate is tested, or it may 


occur many cycles later. 


4.3. Test Vectors for PaRC 


There are a number of problem circuits that are encountered when generating test 


vectors. This section will illustrate some of these by choosing some components in 
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PaRC and showing how they were tested. 


For many of the gates in PaRC devising tests is quite easy. Consider testing the 
data inputs of a memory cell in a packet buffer. If we store a packet that has a one in 
that location, and then later read it out, that location’s data lines have been tested 
for stuck-at-zero faults.1 Later a packet can be stored that has a zero in that location. 


Reading it out will complete the test for stuck-at-one faults. 


Just by receiving and sending a few packets, much of the logic in the input and 
output ports is tested. This is because much of the logic is used every time a packet 
is received and sent. Circuitry that is used less often will take more time to test. For 
example, testing the ram cells in the scheduler’s memory will take longer than testing 
packet memory because a write to a scheduler’s memory is done only 1/12 as often 


as a write to a packet memory. 


Circuitry with many more inputs than outputs can be difficult to test, even if it is 
used often. Consider the comparators in the scheduler which are used for Up/Down 
routing. These comparators take as input the number of packets this scheduler has, 
and the number that its partner has. It produces only a single output which says 
which one is less than the other. Since there are so many inputs that give the same 
result, it is easy for a fault to have no effect on many input patterns. In fact there 
are gates that will have an affect on only one of the many possible input patterns. 
To test this component it was necessary to examine each gate in the comparator and 
make a list of what input values would detect the faults that could occur on that 
gate. This was then used to create a list of input patterns that would together detect 
all faults. By carefully choosing a subset of these patterns, only eleven tests were 
needed to cover all faults. The input to this comparator is the number of requests in 
each of two scheduler queues. So for each of the eleven tests, packets had to be sent 
to, or released from, the two output ports so that each port contained exactly the 


For this test a packet may be written into a buffer, held for a long time and then read out. If 
there is a fault on the data input line the wrong value would be written into the ram, and this wrong 
value would later be read out. This is a simple example of a fault not being observed until many 
cycles after it occurred. If the packet is never read out of memory, then the fault would remain 
unobserved. 
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number of requests needed. Since this could take many cycles, the testing of the two 


Up/Down comparators took approximately 500 test vectors. 


In the previous examples, the circuitry could be tested without adding any addi- 
tional hardware. Sometimes more control over the component is needed, so testing 
hardware must be added. One way to add hardware for testing is by placing multi- 
plexers at the inputs (and/or outputs) of a component. The multiplexers will choose 
between the normal inputs for the component, and special inputs used only for testing. 
A variant of this was used for testing the statistic memory. The ram’s output could 
already be observed through normal control port operations. Similarly it’s address 
and write line could be controlled through the control port. Circuitry was added to 
allow the ram’s data input to get its value from the CDATA input bus instead of from 
the incrementer. The CDATA bus is only 8 bits wide, while the ram is 24 bits wide. 
Four extra inputs were used to give a total of 12 bits of data for the ram. The 24 bit 
ram data input was then formed by using two copies of these 12 bits. This provided 


enough flexibility to completely and easily test the ram. 


It is often advantageous to include testing circuitry when a component is designed. 
The 24 bit incrementer in the statistics section is one example. To test this without 
any additional hardware would have taken an inordinate number of cycles. By de- 
signing the incrementer to allow all carry-inputs to be forced high, testing was made 
much more efficient. Testing of the incrementer was also simplified by making use of 


the ability to load the ram with a value from from the CDATA bus. 


Sometimes it is actually impossible to test for some faults without additional hard- 
ware. This is the case when some logic must work correctly in order to prevent a race 
condition. Consider a register which is occasionally disabled in order to prevent a 
race condition. (This occurs in the select-port section of the input port.) The enable 
signal could be stuck in such a way that the register was always enabled. This fault 
is not easy to detect because under test conditions the race may always be “won” 
by the appropriate signal. But this is no guarantee that the same will be true when 


the chip is in actual use. To test the enable input the output of the register must 
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be made observable. This will make it possible to see when the register is loaded. 
One way to make the signal visible is to make it a chip output, but this could waste 
many outputs. In PaRC a number of hard-to-see signals were wired into a tree of 
EXCLUSIVE-OR gates whose result is sent off chip. Such a tree has the property 
that if a single tree input changes, then the output of the tree will also change. So 
if any one of the inputs to the tree has the wrong value, the output of the tree will 
be inverted, so the fault will be detected. We will have no information about which 
input to the tree caused the fault, but this is not a problem since we have no intention 


to try to debug the internals of a faulty chip. 


Another place where additional hardware must be added is in the scheduler. When 
a request arrives in a scheduler it is written into the scheduler’s s-fifo. If the s-fifo 
is currently empty, the request must be sent directly to the transmitter. The s-fifo 
write is still done, and after a short delay the written data will appear at the output 
of the s-fifo. The data sent to the transmitter is selected by a multiplexer which can 
choose either the data coming out of the s-fifo, or the data going into the s-fifo. Since 
the test vectors run much slower than the chip will operate, when a write is done to 
an empty s-fifo the two inputs to the multiplexer will become the same before the 
multiplexer’s output is used. Yet the test vectors must somehow guarantee that the 
multiplexer is selecting data from the correct input. This is done by adding a chip 
input (nTESTMODE) that can force the scheduler input (but not the mux input) 
to a known state. We can use this to cause the multiplexer inputs to differ while in 
cut-through mode, thus allowing the multiplexer to be tested. This input is used only 


for testing and must be tied high during normal use. 


Lastly, there was part of a component that could not be efficiently tested even if 
a moderate amount of additional hardware was added. This was the clear inputs in 
the schedulers’ 17 bit unary counters. Each of these counters contains 17 registers, 
and only one register in a counter may be high at a time. To test the clear inputs to 
each register we would need to reset the counter while each register had the one high 
bit, and make sure that each reset took place correctly. Since these counters are only 


incremented once per packet, this would have taken several thousand vectors. Instead 
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some extra circuitry was added (just two gates) so that even if a single register’s clear 


input is stuck, the counter will be correctly reset anyway. 


The process of test vector generation for PaRC was time consuming since first the 
entire chip had to be examined to determine what needed to be done to test each gate. 
Then test vectors had to be created that tested all the gates. These test vectors had 
to conform to certain limitations so that the test equipment will be able to perform 
the tests correctly. The test vectors themselves also had to be tested to ensure that 
they would not fail on a correctly fabricated chip. All total, some 13,000 test vectors 
were generated for PaRC. 
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Chapter 5 


Conclusions 


5.1 Future Work 


The design of the PaRC chips described in this document has been completed, and 
working chips have been received. In the future it may be desirable to build another 
version of PaRC to take advantage of improvements in technology. There are several 


modifications to PaRC that any future versions should consider. 


One change would be to implement PaRC in a BiCMOS technology instead of 
CMOS. A BiCMOS chip has a CMOS core of gates, with bipolar gates on the periph- 
ery. The bipolar gates are faster than CMOS, and would be used mostly for input and 
output functions. By using bipolar for receiving and sending data we could output 
data at a higher frequency than PaRC currently can. This might allow us to move 
the functionality of the Data Link Chip onto the PaRC chip. For each PaRC chip 
used this would save the board space used by four DLCs, as well as the TTL/ECL 
level converters that are currently needed between PaRC and the DLCs. Since data 
is being sent out at a higher frequency we would be able to provide a much higher 
bandwidth per port. We would also have the option of keeping the same bandwidth 


per port but either reducing the number of pins or increasing the number of ports. 


A future version of PaRC might include a retry mechanism. A retry mechanism 


!Bipolar technology is not used for the entire array because it has significant power and density 
disadvantages. 
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means that when an error is detected in a packet, the receiver would ask the sender 
to send it again. So instead of merely detecting errors, PaRC would try to correct 
them. The disadvantage of providing retry is that packets must be kept around until 
the sender is sure that the receiver has received it correctly. This takes away buffer 
space that could be used for storing incoming packets. Given a limited amount of 
space for buffering, this would reduce the throughput of the chip. We did consider 
including retry in PaRC. However, the links being designed were expected to be as 
reliable as a backplane; so the gains from having retry did not seem worth the costs. 
Any future versions of PaRC would probably be done in a smaller geometry process 
which would allow more gates per die. This would allow more packet buffering to be 


added, so including retry would not have much of an effect on throughput. 


A final possible change would be incorporating additional routing paradigms. The 
current paradigms work very well when the network can be divided into a small 
number of stages such that a message will only pass through a given stage once. 
In these networks messages tend to pass through only O(logn) switches. Butterfly 
networks and fat trees are two examples of this. Networks where messages make 
many more hops, such as in a mesh, do not fit this structure well. PaRC can only be 
used to implement very small networks of this type. There are a number of ways the 
routing mechanism could be extended to support this. One is to allow each PaRC to 
be given an ID equal in length to the routing data. Each port would look at certain 
bits of a packet’s routing data, compare it to the same bits in the ID, and choose 
the output port based on how they compare. This would give enough flexibility to 


support other networks, such as a multidimensional mesh. 


5.2 Summary 


Even without these changes, PaRC is a very powerful building block for constructing 
networks. We designed PaRC to be able to be used in a large variety of high perfor- 


mance networks. There are a number of features of PaRC that make this possible. 
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PaRC can be used to implement networks with both high throughput and low la- 
tency. Each port of PaRC provides 800 Mbits/second of bandwidth, and much of 
that bandwidth is usable. Also, when a packet comes into PaRC and its output port 
is available, the latency of that packet will be at most 5 > cycles (110 ns). 


A number of aspects of the design contribute to achieving this performance. In 
particular, the design of the packet buffers and scheduler help tremendously. The 
packet buffers do not constrain packets which arrived via the same port to be sent 
out in the same order in which they arrived. In addition, the packet buffers allow 
many buffers from the same port to be read simultaneously. These properties greatly 
reduce the amount of blocking that occurs since an output port will never be blocked 
because another output port is using resources that it needs. In order to effectively use 
the large number of independent packet buffers, the scheduler uses a fifo of requests. 
This provides a very fast scheduler; the scheduler can process a request on the same 
cycle the request is made. The buffers and schedulers allow a packet to be read out 
before the entire packet has been received. This virtual cut through is essential in 
minimizing the latency of packets. Using scheduler fifos also provides a very equitable 


scheduling strategy, coming very close to a first-in-first-out ordering. 


The interface also helps to make effective use of the bandwidth. The interface does 
not need a dedicated line to indicate when packets begin. Given a fixed interconnec- 
tion bandwidth, this increases the bandwidth available for data. Also the interface 
protocol does not require idle time between packets; so if packets are available, they 
will be sent out continuously. Of course engineering PaRC to run at a high speed, 50 


MHz, helps by giving each port a high raw bandwidth. 


PaRC provides a great deal of flexibility for the interconnection networks built out 
of it. Much of this flexibility comes from its routing mechanisms. Many types of 
networks can be built out of PaRC. Networks with up to 2!° distinctly addressed 
nodes are supported. The basic mechanism allows routing to be based on any two 
bits of a packet’s routing information. Routing can also be influenced by which 


port the packet used to enter the chip. For networks which have several paths to a 
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destination, PaRC allows routing decisions to be based upon congestion. This can 
help to even out the flow of messages. The butterfly and the fat tree are just two 


examples of the types of networks which can be built out of PaRC. 


Interconnection flexibility is also provided in ways other than routing. PaRC uses 
an asynchronous clocking scheme, wherein it assumes that each input port operates 
and receives data asynchronously from the rest of the chip. This allows PaRC to be 
used in a machine regardless of whether a synchronous clock is available. This, along 
with PaRC’s flow control mechanism, allows the use of a wide variety of link lengths. 
Lastly, PaRC provides a statistics facility so that we are able to determine how well 


a network is performing. 


PaRC also provides a high degree of reliability. PaRC includes 32 bits of CRC 
error checking in every packet. This is used to ensure that messages have not been 
corrupted. PaRC also checks for errors when packets are not being received. This 
helps to prevent packets from being unknowingly lost if a start of packet indicator 
is missed. Checking for errors when packets are not being received also helps by 
detecting problems as soon as they occur. This may allow us to stop the use of a bad 


connection before any data has been lost. 


PaRC is also able to provide a fast acknowledgment that a packet has been received. 
By using circuit switched packets when an acknowledgment is needed, we can provide 
an acknowledgment more quickly than by other methods. (When an acknowledgment 
is being sent back to the sender, it takes under 15ns to pass through a PaRC chip.) 
This method also reduces traffic since no extra packets need to be sent to provide the 


acknowledgment. 


The PaRC chips described in this report have been fabricated by LSI Logic, and 
have successfully run the test vectors. Several Monsoon processor boards have been 
built, each incorporating a single PaRC chip which operates at full speed for receiving 
and sending tokens. PaRC has been tested in these boards and so far no problems 
have been found. Early in 1991 4x4 switch boards built using a PaRC chip will be 


available. These will allow larger versions of Monsoon to be built. 
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Appendix A 


PaRC User’s Guide 


A.1 Introduction 


PaRC is a 4 by 4 packet routing switch which has been designed and fabricated as a 
CMOS gate array. PaRC receives packets via one of its four input ports, stores the 
packet in an on-chip buffer, and eventually transmits the packet via one of its four 
output ports. Each input port operates asynchronously and has enough buffering to 
store four packets. PaRC can operate at speeds up to 50 MHz, giving each port a raw 
bandwidth of up to 800 Mbits per second. The buffering and scheduling algorithms 
implemented in PaRC allow PaRC to make effective use of this bandwidth, while 
providing a low latency (as low as 54 cycles). The routing mechanism in PaRC allows 
it to be used in a wide variety of networks. In addition, PaRC provides a mechanism 
whereby a processor can quickly receive an acknowledgment when a packet it sent 
has been received. This user’s guide will first give an overview of the chip, and then 
describe the details of the function of the chip. It will not describe the details of the 
implementation, except where such details are necessary to understand the behavior 
of the chip. Also, this document will only give a sketch of why PaRC functions as it 
does. For more information on this, and on the implementation, see the body of this 


document. 


These are the components which make up PaRC, along with their primary function: 


e 4 input ports: receive packets 
e 16 packet buffers (4 per input port): store 1 packet per buffer 


e 4 output ports: each made up of: 
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— transmitter: transmit packets 


— scheduler: determine which packet should be sent next 


e 1 control port: used to control the operation of PaRC 


The input port is the section of PaRC that receives incoming packets. Each input 
port has 4 packet buffers associated with it. When the input port detects the begin- 
ning of an incoming packet, it chooses an empty buffer and writes the packet into 
that buffer. As it does this it checks the packet for errors by the use of a CRC code. 
It must also determine which output port the packet should go to, and inform that 
output port’s scheduler that a packet is waiting and where it is being stored. The 
input port must also generate a flow control signal, so that it does not receive more 


packets than it has room for. 


Each packet buffer can store exactly one packet. For performance reasons a buffer 
must have several characteristics. It must be able to begin reading out a packet while 
that packet is still being written into the buffer. This helps decrease the latency of 
packets. Also, it is important that each buffer can be read by any output port at 
any time, regardless of which other buffers are currently being read. This will greatly 


decrease blocking in PaRC and thus improve both throughput and latency. 


Each output port consists of a scheduler and a transmitter. The scheduler keeps 
track of all the packets which wish to use the output port and chooses which packet 
will be transmitted next. If two or more of the waiting packets were received via 
the same input port, the scheduler guarantees that they are transmitted in the same 
order in which they arrived. When the transmitter is ready to send a packet, it first 
finds out from the scheduler where the next packet is being stored. The transmitter 
then reads the packet out of a packet buffer and transmits it to the next stage of the 
network. If the packet is one which requires an acknowledgment (such packets are 
called circuit switched packets), the transmitter must also keep track of the packet’s 
input port and begin to look for an acknowledgment that the packet has reached its 
destination. When this acknowledgment arrives, the transmitter must pass back an 


acknowledgment through the packet’s input port. 


The control section has two main functions. One function is to keep statistics on 
the performance of the network. The other is to act as an interface to the network 
control system. This interface can be used to control and monitor the performance 
of PaRC. In particular, the control section contains a series of registers each bit of 


which controls different aspects of PaRC. These bits will be referred to throughout 
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this guide, however the details of the reading and writing of these registers will be 


described only in the section on the control port. 


A.2 Input Port 


A.2.1 Input Data Bus 


An input port is responsible for receiving packets into PaRC. It receives packets over 
a 16 bit data bus called, for input port «, DIN«.[15..0]. It also receives a clock, called 
XCLKa, which is used to clock in the data on DIN. The rising edge of this clock 
must occur while the data is stable. The input port will be operate based on this 
incoming clock. Each port’s XCLK is assumed to be asynchronous from the other 
XCLKs and from ICLK, PaRC’s main clock. However, each XCLK must have a 
frequency close to ICLK’s (within 10%). 


A PaRC packet is 192 bits long, made up of 12 16-bit words. The first 16 bit word 
is a header. The header contains all the information needed for routing the packet. 
The next 9 words are data and the final 2 words normally contain the CRC value 
which is used for error checking. Packets cannot be interrupted or canceled. Therefore 
once the first word of a packet is received, the next 11 words will be received on the 


following cycles. 


WORD USE 
Header 


WORD USE 
Data word 5 
7 Data word 6 


/WORD 

po 

Data word 0 
Data word 1 
Data word 8 
10 CRC word 0 
11 CRC word 1 


Data word 2 
Data word 3 


| WORD | 
pe 
a 
8 | Data word 7 
9 
| 
pou 


Data word 4 


PaRC Packet Format 


During each cycle the input port will receive a word on its DIN bus, and it must 


decide whether or not this word is part of a packet. This is done by using bit 15 as a 
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start-of-packet (SOP) bit. If the input port is not currently receiving a packet, and if 
the new word has 0 in its start-of-packet bit, then this word is not part of a packet; 


it is an idle word. 


When a word arrives where this bit is 1, that word is taken to be the first word of 
a packet. Since packets cannot be interrupted, the next 11 words must be the rest of 
the packet, and can use bit 15 as a data bit. Once the packet ends, the input port 
begins to look at the start-of-packet bit again. There is no need for an idle word 
between packets; a new packet may begin immediately after the previous one has 


completed. 


When an input port is not receiving a packet, it should be receiving one of two idle 
patterns: x5555 and x2AAA. (Note that bit 15 of each of these words is a 0.) If the 
input port receives a word that is not the start of a new packet (i.e., bit 15 is a 0.), 
but is not an idle pattern either, then an idle error (IDLERRz) occurs. The control 
port is informed that this error has occurred, and the input port continues operating, 
treating the word as an idle word. Checking for idle patterns can be disabled by 
setting the CHK-IDLE bit [bit 4 of the STATUS register] in the control port to 0." 


Idle patterns are used to allow the input port to catch almost all transmission 


errors. Here are the possible transmission errors, and how they will be detected: 


e Error occurs inside of packet (i.e., any bit except SOP bit). This will cause a 
CRC error. 


Error turns SOP bit from 1 to 0. This will cause an idle error. 


e Error occurs in bits 0-14 of idle pattern. This will give an idle error. 


e Error occurs in bit 15 of idle pattern (i.e., turns SOP bit from 0 to 1.) This 


will create a packet which will cause a CRC error. 


This makes it very unlikely that an undetected error will occur. And when errors 
do occur, it will be easy to pinpoint their location since we can tell where the error 
first appeared. 

In addition to the start-of-packet bit, the header also has two other uses. Bits 
13..0 are used as a routing address to determine which port the packet will be sent 


'Only one idlerr is asserted for each burst of bad idle words. This may be more meaningful than 
counting each individual error, especially since the idle-error counter only counts up to three. 
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to. This provides 2!* distinct routing addresses. The section on routing describes 
how these bits are used. Bit 14 is used as a “circuit switch” indicator. If this bit 
is 1 then then the packet is treated as a circuit switched packet (CSP). Optionally, 
the circuit switch feature can be turned off by setting the ALLOW-CSW bit [bit 0 of 
the STATUS2 register] in the control port to 0; when this is done no packets will be 
treated as circuit switched packets. Bit 14 can also be used as an additional bit of 


the routing address. Usually this is only done when ALLOW-CSW has been set to 0. 


Header Format 
me 


CSP/ROUTE-DATA.14 | ROUTE-DATA.[13..0] 


Also, it is possible to “turn off” an input port. This should be done, for example, 


before a link talking to an input port is replaced or resynchronized. Each input port 
has a nSTOP bit in the control section. When this bit is 1, packets are received 
normally. When it is set to 0, the packet currently being received (if any) will be 
accepted normally, but no more packets will be accepted until the bit is returned to 
1. The nSTOP bits are found in the nSTOP register of the control section. Bit x of 
nSTOP.[3..0] is the stop bit for input port z. 


Each input port only has enough buffers to store four packets. The flow control 
mechanism should prevent packets from arriving when there are no buffers for them 
(see Section A.2.5). If a packet arrives when there is no room for it the control port 
is notified that a room error (ROOMERRz) occurred. The input port will read in 
the rest of the packet normally and then simply discard it. 


A.2.2 CRC 


Error checking is done on packets by use of a Cyclic Redundancy Code (CRC). PaRC 
uses a 32 bit checksum and accumulates values using the CRC32 polynomial.” As they 
are received, the header and the 9 data words are accumulated to give a checksum. 
This checksum is the 32 bits that should appear in the final two words of the packet: 
CRCO and CRC1. CRCO[15..0] are bits [31..16] of the checksum. CRC1[15..0] are 
bits [15..0] of the checksum. 


?The polynomial used is #3? + a?9 4+ a?! + alt + a? +1. 
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With CRC codes, it is possible to prove that certain classes of errors will always be 
detected. When checking a packet for errors, the CRC32 code used will always detect 
an error when there are only 1, 2, or 3 erroneous bits in a packet, when there are 
an odd number of errors in a packet, or when all erroneous bits occur within a span 
of 32 bits. This last property means that if all errors in a packet occur within two 
consecutive PaRC data words, errors will always be correctly detected. For errors 
which span more than 32 bits, the probability of an error going undetected is less 
than 1 in 10°. Additionally, when errors occur on only one of a port’s 16 data input 


lines, this code will always detect the errors. 


When a packet arrives, the input port will check to see if the CRC is correct. If 
input port x detects an error then a CRC error (CRCERROR zx) occurs. When this 
happens the input port notifies the control section, but continues operating normally. 
CRC checking can be shut off by setting the CHECK-CRC bit [bit 1 of the STATUS1 
register] in the control port to 0. PaRC can also be told to generate the CRC; this is 
done by setting the GEN-CRC bit [bit 0 of the STATUS2 register] in the control port 
to 1. When this is done the input port will ignore the last two words of incoming 
packets and replace them with the correct CRC words. Note again that the two 
words which would otherwise have been the CRC are ignored; an input port will not 
begin to look for a new packet until after these two dummy words have been received. 
Also, the value of the CRC control bits are used by a packet only as it arrives. If, for 
example, GEN-CRC is 0 while the packet is being received, but later changes to 1, a 
CRC will not be generated for the packet. 


PaRC may also be configured to use only one 16 bit CRC word. This is done by 
setting the CRC2 bit [bit 2 of the STATUS register] to 0. If this is done the first 11 
words of the packet are used for generating the checksum. The next word is expected 
to be bits [31..15] of the generated checksum. PaRC can check or generate CRCs in 
this mode, based on the GEN-CRC and CHECK-CRC bits described above. This is 
available to provide some error checking in the event that more data needs to be put 
into each PaRC packet. Use of the CRC1 and CHECK-CRC bits allows PaRC to be 
configured to have packets with 9, 10 or 11 words of data (with 2, 1, or 0 words of 
CRC, respectively). 


Normally only zero or one of GEN-CRC and CHECK-CRC need be set. If both are 
set PaRC will generate a CRC and check this generated CRC. An error (CRCERR) 


would be found only if there were a serious problem (e.g., setup/hold time violation 


3 Additionally, bits in the header which are not used for routing could be used as data bits. 
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or a defective chip). 


A.2.3 Routing 


When a packet arrives, the input port must decide which of the 4 output ports should 
transmit the packet. This is where the 14 (or 15) ROUTE-DATA bits from the header 
are used. Each PaRC chip can be made to look at different bits of ROUTE-DATA. 
This allows networks of size 2'4 = 16K to be supported (32K if no circuit switched 


packets are needed). 


The control section of PaRC contains an 8 bit register called RINF. This controls 
which bits of the header are used in determining the destination of a packet. The 
destination is a two bit number which designates one of the 4 output ports (numbered 
0 to 3). Each bit of this destination is determined separately. The 4 higher order bits 
of RINF control the selection of the higher order bit of the destination. The 4 lower 
bits of RINF control how the lower bit of the destination is selected. 


Normally each of the 4 bit values within RINF designates one location in the 
header, and the bits from the designated locations are concatenated to form the 
destination. If RINF[7:4J=x and RINF[3:0]=y then the bit 1 of the destination is 
ROUTE-DATA.z and bit 0 is ROUTE-DATA.y. For example: if RINF = x10, then 


the lower two bits of the header are used as the routing address. 


There are 2 RINF values that are treated differently: E and F. A RINF value of F 
does not select bit 15 of the header (it is always 1 anyway), but instead selects a bit 
from the input port’s location. F in the upper part of RINF means use the upper bit 
of the input port’s location, while F in the lower bit means use the lower bit of that 
port’s location. This allows routing to be influenced by which port a packet arrived 
in. For example, in port 2 (= b10) an F in the upper nibble of RINF means the upper 
bit of the destination is always 1. An F in the lower nibble means the lower bit of 


the destination is always 0. 


The value E is only treated specially when it appears in the lower part of RINF. 
E means that “UP/DOWN” routing is being done. In this mode, the upper bit of 
the destination is chosen first. This narrows down the possible output ports to two. 
The destination is then set to the port (among the two possibilities) with the shortest 
request queue. (Actually, since there is some delay, it is sent to the port which had 


the shortest queue 2 or 3 cycles ago). 
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When the value E appears in the upper part of RINF, it is treated normally; 7.e., 
it causes ROUTE-DATA.14 to be used as the upper bit of the destination. When this 
is done Circuit Switching would usually be disabled, as bit 14 is also used to indicate 


a circuit switched packet. 


Also, the value of RINF should not be changed while packets are being received, 
as this may cause unpredictable results. To change this value while the network is 
in use, you must first guarantee that no packets are being sent to the input ports. 
This may be done by asserting all of the WAIT signals, or by temporarily stopping 
the output ports that are sending packets to this port. [See below for information on 


how to do these.] 


The following table summarize this information, and the next table gives some 


examples. 


ROUTING INFO 
8 bits (X,Y): 4 bits each for upper and lower bits 
X == get upper routing bit from Xth bit of header. 
Y == get lower routing bit from Yth bit of header. 
Exceptions: 
X == xF: set upper bit to be the upper bit of that port’s location. 
Y == xF: set lower bit to be the lower bit of that port’s location. 
Y == xE: Choose between ports 0 and | (if upper bit is 0; else 2 and 3) 


based on which output port has fewer packets waiting to use it. 
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RINF Examples 
RINF | Function (where X,Y are not one of the special values) 


po | 

XY | 

XX | 

= 

Acts as two parallel 2x2 routers: [= 2*(2x2) routing] 

Packets coming in on upper ports(3,2) go out one of 
the upper ports. The value of bit Y determines which one. 
(Likewise for lower ports). 


Does up/down routing 
Based on bit X choose between upper or lower ports. 
Then choose the output port with the shortest queue. 
Acts as two parallel 2x2 “equalizers:” 
Packets from upper ports sent to the upper port with shortest queue. 


Packets from lower ports sent to the lower port with shortest queue. 


A.2.4 Buffer Use 


There are four packet buffers associated with each input port. When PaRC is reset 
each of these buffers is marked as empty. When the input port begins to write a 
packet into one of these buffers, it is marked as full. When that packet is mostly read 


out, it is marked as empty, and a new packet can then be stored in that location. 


The state of these buffers may be determined by reading locations C and D in the 
control port. These locations are called BIU20 and BIU31. BIU20 gives the Buffer- 
In-Use information for the buffers in ports 2 and 0. Each bit corresponds to one of 
the buffers and is high iff that buffer is in use. 


BIU20 


7 6 5 4 3 2 1 0 
BIU2.3 | BIU2.2 | BIU2.1 | BIU2.0 | BIU0.3 | BIUO.2 | BIUO.1 | BIUO.0 
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BIU31 


7 6 5 4 3 2 1 0 
BIU3.3 | BIU3.2 | BIU3.1 | BIU3.0 | BIU1.3 | BIU1.2 | BIUL.1 | BIU1.0 


where, 
BIUX. Y = BUFFER-IN-USE bit for buffer Y of port X, and is high iff that buffer 


is in use. 


A.2.5 Flow Control 


Flow control is needed on a connection (also called “link”) to prevent packets from 
arriving when there are no buffers available to hold them. In PaRC, input port z 
produces the signal WAITz which must be sent “backwards” from each input port 
to the transmitter which sends data to that input port. When an input port brings 
its WAIT signal high, the transmitter which is sending packets to it should not send 
any new packets until the cycle after WAIT goes back low. 


There are 2 modes for determining WAIT: Normal mode, and LWAIT mode. The 
normal way that WAIT is produced is that it is high only when all four of an input 
port’s buffers are in use. This is the ideal signal since it will make optimal use of an 
input port’s buffers. However this mode will work correctly only if the connection 
between the transmitter and the input port is not too long. If the transit time between 
transmitter and input port is too long, the WAIT signal will not be received in time 


to stop the next packet from being sent. 


In order for WAIT to go high in time to stop the next packet from being sent, the 


following must be true.: 


T stop—next—pachet > Taata—transit + T produce—wait + T wait—transit 


where, 


T stop—next—packet = Lime from when word 0 is sent (i.e., when DOUT.16, the clock 
generated by the output port, rises) to when the received WAIT must go high 
in order to guarantee that the next packet will not be sent. (In PaRC this is 9 
cycles. ) 
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Tiata—transit = Time from when word 0 is sent to when it is received at the input 


port. 


Tyroduce-wait = Time from when word 0 is received by the input port to WAIT goes 
high. (15 ns) 


Twait—transit = Time from when WAIT is sent by the input port to when it arrives 


at the output port.* 


From this we calculate that when 2 PaRCs are communicating, normal mode can 
only be used if the transit time from the transmitter’s DOUT to the input port’s DIN 
plus the transit time from the input port’s WAIT to the receiver’s INFI pin is less 
than 8 cycles. This mode should be used wherever possible since it will provide more 


efficient utilization of buffers than will LWAIT. 


For longer connections, we need WAIT to go high sooner. This is done by entering 
LWAIT mode by setting the USE-LW bit (bit 2 of the STATUS2 register) in the 
control section. WAIT will now be high whenever there are only 0 or 1 buffers 
available in input port «. So when an input port raises WAIT, it can still safely 
receive one more packet. This allows us to add over 11 cycles to the Titop—next—packet 


time given above. This will allow links much longer than we expect to use. 


The disadvantage of using LWAIT is that it allows available buffers to remain 


unused. This can happen at two times: 


e When the number of available buffers drops from 2 to 1. WAIT may go high in 


time to stop the next packet, thus leaving one of the buffers unused.” 


e When the number of available buffers rises from 0 to 1. There is now an available 


buffer, but it will not be used until a second buffer becomes available. 


The generation of WAIT in LWAIT mode can be modified to attempt to avoid 
these situations by setting FANCY-LW (bit 3 of the STATUSI register) to 1 (while 
keeping USE-LW at 1). When this is done the generation of WAIT is affected in the 


following ways: 


*Tiata—transit Taay not be the same as Twait—transit. If a DLC (Data Link Chip) is used between 
two PaRCs then Tyata—transit Will be longer since the data must be latched in by the DLC chip and 
serialized, while the WAIT signal can be passed on untouched. 

>This will tend to happen on links which are just long enough to require the use of LWAIT. 
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e When the number of available buffers drops from 2 to 1, the rising edge of WAIT 
will be delayed for 2 cycles. 


e When the number of available buffers rises from 0 to 1, WAIT will go low for 


approximately 2 cycles. 


Both of these will allow the transmitter to send a packet (if it has one) into the 
fourth, previously unused, buffer. When WAIT first goes high, the first change should 
delay the rise of WAIT long enough to allow the transmitter time to send a packet 
into the fourth buffer.° With the second change, when all buffers are full and a packet 
leaves, WAIT will go low for a short time, thereby allowing a new packet to be sent 
to take its place. For both of these modifications, there is only a small window of 
time when a packet will be able to be sent into the only empty buffer. If no packet 
needs to be sent during that time, then the buffer will remain empty. So the fourth 
buffer will be most effectively used during times of heavier traffic, and that is exactly 


when it is needed most. 


The FANCY-LW bit has no effect when LWAIT is 0. It is expected that when 
using LWAIT, FANCY-LW will always be used. Only in the very rare cases when 


there are extremely long links would it need to be turned off. 


Occasionally it is useful to force the input ports to assert WAIT. For example, if 
we wanted to turn off the input ports of a PaRC we should first assert all WAITs so 
that the transmitters talking to those input ports will stop sending packets. When 
the WAIT-ALL bit in the control port [bit 3 of the STATUS2 register] is set to 
1, all WAITs will be forced high. Also, all WAITs may be forced high if the chip 
enters “Passive Mode” after encountering an error. This will be described in the 


documentation on the control section. 


A.3 Output Port 


Output ports are made up of 2 components: a scheduler and a transmitter. The 
scheduler is responsible for determining which packet should be sent next. The trans- 


mitter is responsible for transmitting that packet. 


5This is done at the expense of removing two cycles from the allowable round-trip link transit 
time. 
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A.3.1 Scheduler 


The job of the scheduler is to tell the transmitter which packet is to be transmitted 
next. It does this by maintaining a FIFO queue of packets which want to use its 
output port. When the transmitter wants to send another packet, the scheduler will 
tell it which buffer the packet is stored in and whether or not that packet is a circuit 
switched packet. Since the queue is FIFO, packets going to the same port will be sent 
off the chip in the same order in which they arrived, even if they arrived via different 
input ports. However, packets which arrived on different ports within 2-3 cycles of 


each other may be transmitted in any order. 


Each scheduler keeps a running count of how many packets are in its queue. The 
lower four bits of these counters may be examined by reading locations E and F in the 
control port. These locations are called 5120 and $131. $120 gives this information 
for the schedulers in ports 2 and 0; $131 for ports 3 and 1. 


S120 


912.3 | $12.2 | S12.1 | $12.0 | 510.3 | S10.2 | S10.1 | S10.0 


7 6 5 4 3 2 1 0 
913.3 | 513.2 | S13.1 | $13.0 | S11.3 | SI1.2 | STL | SI1.0 


where, 


SIx.[3..0] are the lower 4 bits of the scheduler counter for output port x. 


This count is also used in UP/DOWN routing. On each cycle the count for output 
port 3 is compared to that of output port 2. If output port 2 has fewer waiting 
packets, then packets that want to go to an upper port will be sent to output port 
2; else they will be sent to output port 3. A similar comparison is done for output 


ports 1 and 0. 
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A.3.2 Transmitter 


The transmitter of output port x sends out its data over a 16 bit bus which is called 
DOUT«.[15..0]. It also generates an output clock called DOUT2.16. This clock is 
generated from the falling edge of ICLK and rises when the data on the DOUT bus 


is stable. 


The format of packets on this bus is identical to the packet description given in 
the discussion of the input port. Whenever the transmitter is not sending a packet, it 
will generate an idle pattern. Normally the idle pattern will alternate between x5555 
and x2AAA. But if the TOG-IDLE bit [bit 5 of the STATUS1 register] is set to 0, 
only the idle pattern x5555 will be used. 


Output port x receives the flow control signal INFIz.1 (also called WAITIN«). 
When this signal goes high the transmitter will not begin to send any new packets. 
(Any packets already being sent will continue normally.) Since PaRC must synchro- 
nize this signal, a packet may still start during the 2 to 3 cycles after it is brought 
high. Similarly, no packets will be begun during the 2 to 3 cycles after this signal 


goes low. 


Circuit Switched Packets 


If the sender of a packet wants to be informed when the packet has arrived, the 
sender can designate the packet a circuit switched packet. Circuit switched packets 
are sent through the network in the same manner as normal packets. The difference 
is that as a circuit switched packet passes through a switch, the transmitter keeps 
a “back pointer” to where the packet came from. These back pointers can then be 
used to send an acknowledgment to the packet’s sender. When a transmitter sends 
a circuit switched packet, it enters circuit switch mode (CSmode) and remembers 
which input port the packet arrived through. It then begins to watch its INFI0.x 
input (also called RELIN.«) for an acknowledgment signal, which is indicated by an 
active high pulse. This input is driven by the RELOUT signal of the input port 
this transmitter is sending to. When this acknowledgment arrives, the transmitter 
exits CSmode and sends out an acknowledgment signal on the RELOUT line of the 
remembered port. (This acknowledgment is also a single high pulse.) While waiting 
for an acknowledgment, a transmitter can continue to send out normal packets. But if 


a second circuit switched packet needs to be sent out during this time, the transmitter 
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will generate idles until the first circuit switched packet’s acknowledgment is received. 


The next circuit switched packet can then be sent. 


The transmitter can be set up to generate an acknowledgment, instead of waiting 
for one. This is done by setting the GEN-ACK bit [bit 1 of the STATUS2 register] to 
1. When this is done, the transmitter will generate the appropriate acknowledgment 
as soon as it starts to transmit a circuit switched packet.’ Note that the ALLOW- 
CSW bit is not used in the output port. If a packet is determined to be a circuit 
switched packet when it arrives, it will be treated as such, even if ALLOW-CSW is 


later set to 0. 


Other Details 


On each cycle each transmitter characterizes the use of its output link during that 
cycle. A link is considered to be in one of four states: 
Idle: No packets being sent, or waiting to be sent. 
Xmit: A packet is being sent. 
CSWait: No packet is being sent because we are waiting for an 
acknowledgment, and a circuit switched packet is to be sent next. 
Wait: No packet is being sent, but one would be sent if WAIT were low. 
This information is called XINF and is used by the control port for keeping statis- 


tics. These values can be obtained by reading location 7 of the control port. This 


location gives the following values: 


XINF 


7..6 5.4 3..2 1..0 
XINF3.[1..0] | XINF2.[1..0] | XINF1.[1..0] | XINFO.[1..0] 


where, 
XINFx.[1..0] = the XINF value for output port x. 


“It is expected that this bit will be set for the next-to-last stage of the network, and the final 
stage will have ALLOW-CSW set to 0. This will cause the acknowledgment to be generated when 
the packet leaves the next-to-last stage of the network. The acknowledgment can be generated at 
that time because once a packet is sent to the last stage of the network, no new packets will be able 
to get in front of it. 
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Encoding of XINF 


00 | WAIT 
01 CSWAIT 
10 IDLE 
ll XMIT 


Also, it is possible to turn off an output port. When this is done, the transmitter 


will complete any packet it was in the process of sending, but will not start to send 
any more packets. Packets can still be routed to this port, but the packets will not 
be transmitted until the port is turned back on. To turn off an output port, its bit in 
the nSTOP register [location 1 in the control section] must be set to 0. The nSTOP 
bit for output port x is bit (a + 4) of the nSTOP register. 


A.3.3 ICLK 


ICLK is the clock which PaRC uses as its main clock. ICLK may be running at 
any frequency up to 50MHz. ICLK is assumed to be asynchronous to the incoming 
XCLKs. But the frequencies of each XCLK must be close to that of ICLK. Each 
XCLK is assumed to have a period that differs from ICLK’s period by no more 
than 10%. ICLK is used both as the main internal clock and also to generate the 
DOUT 2.16 clock that is output by each transmitter. 


A.4 Control Section 


The control section performs a number of functions which affect the operation of the 
entire chip. In particular, it implements the control port which allows a network 
control system to control and monitor the performance of PaRC. The control section 
also keeps statistics on how the output ports are being utilized. To the user, the 


control port looks like 64 locations which can be read or written. 
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A.4.1 Reset 


PaRC can be brought into a known state by resetting it. This can be done either by 
a pin reset or by a reset command. A pin reset is done by bringing the nRESET line 
low. A reset command is done by writing any value to location 5 of the control port. 
When PaRC is reset all packets currently stored, being received, or being transmitted 
will be lost. Once the reset ends, each input port will be ready to receive new packets, 
and will have 4 available buffers to store them in. During reset, each output port will 
begin to send an idle pattern. After the reset ends, the output port will be in the idle 
state and will have no packets waiting to use the port. The only values which may 
be retained during a reset are in the control port. If statistics were being gathered 
while the reset occurred, then the statistics values are no longer valid; otherwise these 
values will be retained during a reset. When a command reset is done, the values in 
the writable control registers (locations 0-4), and the error counters will be retained. 


During a pin reset these values will be reset to their default state. 


The nRESET input is asynchronous; when this line is brought low the reset will 
begin to take effect almost immediately. It should be held low for at least 4 cycles. The 
reset can be considered over, and PaRC full ready for use, two cycles after nRESET 
goes back high. When a reset command is given the reset will begin several cycles 
later, and will last for 6 cycles. The reset can be considered complete 12 cycles after 


the command was given. 


A.4.2 Control Locations 


The last 48 locations are used solely for statistics and will be described later. The 
first 16 locations are used for a variety of purposes. Most of these locations have 


already been mentioned. The following table summarizes these locations: 
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PaRC Registers 0 - F 


Location Default Register 
(hex) Value (hex) | Name 
RINF 
nSTOP 
STAT INFO 
STATUS1 
STATUS2 
PaRC Reset 
Error Reset 
XINF/Sync-Link cmd 
CRC Error 
IDLE Error 
ROOM Error 
LINK Error 
BIU20 
BIU31 
5120 
S131 


2 
3 
4 
5 
6 
7 
8 
9 
A 
B 
C 
D 
E 
F 


Daa AFF DF TF 


Where Type means: 
RW = Can read and write this value. 
R = Can only read this value. 
C = “Command Location” Doing a write to this location causes a 
command to be executed. The value written is ignored. 


The default value is the value the register is loaded with when a pin reset is done. 


Errors 


Previous sections have described how CRC, Idle, and Room errors are detected by 


an input port. When one of these errors occurs, the control port increments the 
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appropriate error counter. For each error type there is a two bit error count for each 
of the four ports. When one of the CRC, Idle, or Room error locations are read, the 
four error counts for that error will be output. These three error locations can be 


interpreted as follows: 


ERROR Counters 
bit bit 


no errors 
1 error 


The only other errors that can occur are link errors. Each output port may be 


directly connected to another input port. However, when PaRCs on different boards 
communicate, data-link-chips (DLCs) will often be used. A link transmitter will 
receive an output port’s data, multiplex it onto fewer wires, and transmit the data 
between boards. A link receiver will receive the data, demultiplex it, and present it to 
an input port. Each link transmitter can generate two types of errors called link-error 
and link-gray. (Produced by the link chip outputs TSYNCERR and TGRAYERR.) 
The PaRC chip receives information about these errors on the inputs LINKERR.[3..0] 
and LINKGRAY.[3..0]. The ith bit of each of these buses should come from the link 
chip connected to the zth transmitter. A rising edge on one of these inputs is used to 


indicate an error. 


For each of the link and gray errors, the control port only has a one bit counter. 
(Once a link chip produces an error, it will not signal another error of that type until 
it has been reset.) This bit is 0 if no such error has occurred, and 1 if the error has 


occurred. These bits can be examined by reading the LINK-ERROR location. The 


value produced can be interpreted as follows: 
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Link Error Information 


bit bit 
7 6 5 4 


LinkGray.3 | LinkGray.2 | LinkGray.1 | LinkGray.0 
Link Error Information 


bit bit 
3 2 i 0 


LinkErr.3 LinkErr.2 LinkErr.1 LinkErr.0 


where, 
LinkGray.x = | if there has been a gray error from link chip x. 

0 if there has not been a gray error from link chip z. 
LinkErr.z = 1 if there has been a link error from link chip x. 


0 if there has not been a link error from link chip z. 


A PaRC chip can be made to ignore gray errors. This is done by using the CHKk- 
GRAY bit (bit 4 of the STATUS2 register). When this bit is set to 0, any gray errors 


will be ignored. 


PaRC produces an ERROR output. This bit will go high whenever an error has 
been detected, and it will stay high until the error counters are reset. When ERROR 
goes high, the four error locations should be read to determine what type of error 


occurred. Note that if an error occurs while we are not checking for that type of error 


(e.g., a Gray error while CHK-GRAY is 0), then ERROR will not go high. 


When the control port is informed of an error it may put the chip into “Passive 
Mode.” When in passive mode, output ports will not begin to transmit any more 
packets, and all input ports will assert WAIT.® Whether or not passive mode is entered 
depends upon the value of the PASS-ERR and PASS-GRAY bits (bits 6 and 5 of the 
STATUS2 register). If a Gray error has occurred and both PASS-ERR and PASS- 
GRAY are 1, then the chip will be put into passive mode. If any other type of error has 


’Passive mode is entered in an attempt to minimize the data lost. For some errors it may prevent 
any data from being lost at all. 
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occurred and PASS-ERR is high then the chip will go into passive mode.’ The chip 
will remain in passive mode until either the error counters are reset, or PASS-ERR 


and PASS-GRAY are changed such that passive mode is no longer indicated. 


Command Locations 


There are three locations which are used for issuing commands to PaRC. These loca- 
tions are called command locations. When a write is done to one of these locations, 
PaRC will perform a certain command. For command locations, the command is 
executed whenever the location is written to; the value written is ignored. The first 
of these locations is the PaRC reset command (location 5), and has already been 
described. The second is the error reset command (location 6). When this command 
is given all the error counters will be reset. The final one is the sync-link command at 
location 7. Writing this location will cause the synclink output to go high for 6 cycles. 
This output will be used to resynchronize the DLC transmitters that are receiving 
data from PaRC. Note that location 7 is also used to read the XINF value. When 
a read is done the XINF value will be returned, when a write is done the synclink 


command is given. Writing this location will have no effect on the XINF values. 


Status Registers 


The bits of the Status registers (locations 3, 4) have been described throughout this 


document. The following tables summarize the use of these registers. 


°Gray errors indicate a non-fatal errors in the link chip, which means that the data being trans- 
mitted is still correct. By treating Gray errors specially, we have the flexibility to continue operating 
after a Gray error, but go passive on all other errors. 
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Statusl register 
ee Default meaning if bit is high 


Gen-CRC Generate CRC for incoming packets. 


Use-CRC Check CRCs in incoming packets. 


CRC2 Use 2 word CRCs. 
Fancy-LW | Use Fancy LWAIT modifications to LWAIT. 
Chk-Idle Check incoming idle patterns. 


Tog-Idle Toggle output idle patterns. 


Status2 register 
ee Default meaning if bit is high 


Allow-CSW | Allow Circuit Switched packets. 
Gen-Ack Generate an ack when a CSP is transmitted. 


Use-LW Generate WAIT using the LWAIT definition. 


Wait-All Assert all WAIT outputs. 
Chk-Gray Check for Gray errors from the links. 


Pass-Gray | Allow passive mode to be entered on Gray errors. 


Pass-Err Enter passive mode when an error is detected. 


A.4.3 Statistics 


The statistics section is used to obtain information on how the output ports (and 
hence their associated links) are being utilized. For each output port we keep track 
of four statistics, based on the current state of the output port. The value of the 
XINF signal is used to characterize the state of the output port. As described ear- 
lier, these four states are: transmitting (XMIT), idling (IDLE), waiting for a circuit 
switch acknowledgment (CSWAIT), and waiting for the WAIT signal to be deasserted 
(WAIT). The statistic values will give us an indication of how much time the output 


ports spend in each state. 


The statistics section compiles these statistics, and stores them in a 16x24 ram. 


They can be read out through the control port. The STAT-INFO register is used to 
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control the statistics generation. STAT-INFO is a 4 bit register. Two of these bits 


are only used for testing and must be set to a specified value!®: 


STAT INFO 


3 0 


The statistics section can be in one of two modes. When the Stat-Mode bit is 0, 
it is in Read-and-Clear mode. During this mode the statistic values can be read or 
they can be reset to 0. Since statistics are 24 bit values, and the control port bus is 
only 8 bits wide, it takes 3 reads to receive an entire statistics value. When reading a 
statistics value the upper two bits of the address determine whether the lower, middle 
or upper byte of the statistic is read. The next two bits determine which port the 
statistic will apply to. The last two bits determine which of that port’s statistics is 


read. 


Each location can be identified as SCpsb. The p identifies the port number (0-3), 
s identifies the state (X,I,C,W), and 6 identifies the byte (L,M,U). The Lower byte 
is bits [7..0] of the value, the Middle is bits [15..8], and the upper is [23..16]. The 


following table shows how each byte may be accessed. 


l0Bit 2 is used for testing the incrementer. If it is high, the incrementer will increment each nibble 
of its input. If Bit 1 is high, the statistics section will stay in Read-and-Clear mode. However, instead 
of clearing locations, writes to the ram will store either the value coming from the incrementer (if 
bit 0 is high), or a value obtained from the control port’s data bus (if bit 0 is low). 
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Location | Value 
30 SCOXU 
31 SCOIU 
32 
33 SCOWU 
34 
35 SCIIU 
36 
37 
38 
39 SC21U 
3A 
3B 
3C 
3D 
3E 
3F 


Location | Value 

10 SCOXL 20 SCOXM 
11 SCOIL 21 SCOIM 
12 | SCOCL 
13 SCOWL 23 SCOWM 
14 24 SCIXM 
15 25 
16 26 SCICM 
17 27 SCIWM 
18 28 SC2XM 
19 29 
1A 2A 
1B 2B 
1C SC3XL 2C SC3XM 
1D SC3IL 2D SC3IM 


1E SC3CL 2h 5C3CM 


1F SC3WL 2h SsC3WM 


Location | Value 


SCLIL 


SsCLWU 


SC2IL 
5C2CM 
SC2WL SsC2WM SsC2WU 
SC3XU 
sC31U 

SC3CU 


SsC3WU 


ep ep 
Q OQ 
N re 
—- — 
= = 


[Location | Value_| 
| 10 | Sox | 
| | secon | 
| | scoct. | 
| 13 | scows | 
| | Scaxt | 
pts | scat | 
| 16 [scict, | 
| ar | sca we | 
| 18 | Sax | 
| 19 | scart | 
PA [scact, | 
| 1B | scawe | 
| ie | scaxt | 
| iD [scsit_ | 
| IB | scsct | 
| IP | scswe | 


[Location | Value 
| 30 | ScoxU J 
| at | scoru_| 
| 2 | scocu | 
| 88 | scowt 
| 4 | Sax | 
| 35 | scuu_ 
| 36 | scicu_ 
| 37 | scuwu | 
| 88 | 8caxU | 
| 39 | scare] 
| 3A | sc2cu_ 
| 3B | scawu | 
| 3c | scaxu | 
| ab | scat 
| 3B | scscu | 
| aP | scawu | 


Statistic Locations 


When the statistics section is in Read-and-Clear mode, doing a write to any statistic 


location will reset all 24 bits of that statistic to 0. (The value written is ignored.) 


When the Stat-Mode bit is set to 1, the statistics section enters AUTO mode, and 
begins to compile statistics. The statistics section determines the state of an output 
port by looking at XINF. It then reads that state’s statistic value from statistic 
memory, increments it, and writes the new value back into memory. This process 
takes three cycles, after which it is repeated for the next output port. Since there 
are 4 output ports, each one is sampled every 12 cycles. Since the packet length is 
also 12 cycles, each port’s XMIT statistic gives a count of the number of packets that 
were transmitted by that port. 


In normal use the statistics will be put into Read-and-Clear mode and 16 writes 


will be done to reset all locations. Then the statistics section will be put into AUTO 
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mode to collect statistics. It will later be returned to Read-and-Clear mode so that 
all statistics can be read. Values can then be reset to 0, and statistic collection can 


be begin again. 


When collecting statistics, the statistic values must be sampled and reset regularly, 
or they may overflow. Since values are 24 bits long and can be incremented only 
on every 12th cycle, overflow could occur after 274 * 12 cycles. At 50MHz, this is 
approximately every 4 seconds. To make it easier to read the statistics before they 
overflow, PaRC can produce a Statistics-Half-Full (SHF) signal. This signals goes 
high when a statistic value that is being incremented is more than 2° (i.e., more 
than half of its maximum value). It will drop back low when any one of the statistics 
values are reset. This signal is only produced when the Mask-SHF bit (bit 3 of the 
STAT-INFO register) is 0. If Mask-SHF is set to 1, SHF will always be 0. 


A.4.4 Control Port Interface 


The control port consists of 64 locations which can be read and/or written via a 


simple protocol. It is implemented using several signals: 


CAD.[5..0]: The address being operated upon. 
CR_W: The write line (active low) 
CDATA.[7..0]: Bidirectional data bus 


C_OE: Output enable (active low) for CDATA bus 


CDATA is the bidirectional bus on which data is sent to and from the control port. 
When C_OE goes high the CDATA bus will be tristated within time Tegata_-. A read 
is done whenever C_OE is low, and CAD has a stable value. If CAD is changed after 
C_OE is brought low, then after time T,caq CDATA will have the correct value. The 
value of Treaq depends on the location being read and is listed in the chart below. 
If C_OE is brought low as, or after, CAD is stabilized then the time until CDATA 
has the correct value is the maximum of T,.caq after CAD is stable, and Tedata—en after 


C_OE goes low. 


Remember that the 48 statistic locations can only be read while in Read-and- 


Clear mode. An attempt to read or clear a statistic location while statistics are being 
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collected will be ignored.t' Also, remember that the values from the error locations 
(8,9,A,B) and the info locations (7,C,D,E,F) simply reflect the current internal state 
of PaRC. As such they may change (asynchronously) while they are being read. 


Writes are done by asserting the address on CAD, the data on CDATA, and then 
pulsing CR_W low. Write pulses must last at least time T.»—min, and be separated by 
time Twp—recover- Lhe address must be set up Taddr—setup before the write pulse begins 
and held until Taddr—nota after it ends. The data must be set up Tyata—setup before, and 
held Taata—-hoid after the beginning of the write pulse. When a write is done, it will 


take 2 or 3 cycles from the beginning of the write pulse for the write to complete. 


When a write is issued, it will occur regardless of the value of C_OE. So if a write 


is being done to a command location, it is not necessary to bring C_OE high and 


drive the CDATA bus. 


‘Reading a statistic location while statistics are being collected will actually return a portion of 
that statistic location which the statistics section is currently incrementing. 
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Control Port Timing 


CDATA Timing *! 
Tedata—z 15 ns Time from C_OE rises until CDATA bus is high-Z. 
Tedata—en 15 ns Time from C_OE falls low until CDATA bus is stable. 


Read Timing *! 


Tread—entr! 30 8 Read time for Control registers (locations 0,1,2,3,4). 
Tread—error 930 ns *? Read time for Error registers (locations 8,9,A,B). 
Tread—stats 930 ns *? Read time for Info locations (locations 7,C,D,E,F). 
Tread—info 40 ns Read time for STATS locations. 


Write Timing 


Taddr—setup  1O ns Time addr should be stable before CR_W goes low. 
Taddr—hold 10 ns Time addr should be stable after CR_W goes high. 
Tdata—setup LO ns Time data should be stable before CR_W goes low. 
Tdata-hold 10 ns Time data should be stable after CR_W goes low. 
Twr—recover 4cycles Minimum time between start of write pulses. 
Twp—min lcye.t+10ns Minimum pulse width on write line. 

NOTES: 


*1 — These times assume an output loading of 35pf or less on the CDATA bus. If 
the loading is greater the delay will be longer. 


*2 — These values merely reflect the current internal state of PaRC; as such they 


may change (asynchronously) while they are being read. 


117 


Bibliography 


Arvind, Kk. Ekanadham, and D. E. Culler. The Price of Asynchronous Parallelism: 
An Analysis of Dataflow Architectures. In CONPAR 88, Manchester, England, 
1988. 


G. A. Boughton. Data Link Chip. Internal Memo, Computation Structures 
Group, Massachusetts Institute of Technology, Cambridge MA, March 1989. 


W. J. Dally. Virtual Channel Flow Control. In Proceedings of the 17th Interna- 
tional Symposium on Computer Architecture, Seattle, Washington, May 1990. 


G. L. Frazier and Y. Tamir. The Design and Implementation of a Multi-Queue 
Buffer for VLSI Communication Switches. In Proceedings of the [FEE Confer- 
ence on Computer Design: VEST in Computers and Processors, Cambridge, MA, 
October 1989. 


G. Grafe and J. E. Hoch. The Epsilon-2 Multiprocessor System. To appear in 
Journal of Parallel and Distributed Computing, December, 1990. 


C. F. Joerg. Design of a Packet Switched Routing Chip for the Dataflow Su- 
percomputer. Bachelor’s thesis, Dept. of Electrical Engineering and Computer 
Science, Massachusetts Institute of Technology, Cambridge MA, May 1987. 


C.E. Leiserson. Fat Trees: Universal Networks for Hardware-Efficient Supercom- 
puting. In [EEE Transactions on Computers, vol. C-34, No. 10. October 1985. 


R. S. Nikhil. Id Nouveau Reference Manual, Part I: Syntax. Technical Report, 
Computation Structures Group, Massachusetts Institute of Technology, Cam- 


bridge MA, April 1987. 


G. M. Papadopoulos. Implementation of a General Purpose Dataflow Multi- 
processor. PhD thesis, Dept. of Electrical Engineering and Computer Science, 
Massachusetts Institute of Technology, Cambridge MA, August 1988. 


G. M. Papadopoulos and D. E. Culler. Monsoon: An Explicit Token Store Ar- 
chitecture. In Proceedings of the 17th International Symposium on Computer 
Architecture, Seattle, Washington, May 1990. 


118 


ll 


12 


13 


14 


15 


[16] 


M. C. Pease, The Indirect Binary n-Cube Microprocessor Array. [EEE Trans. 
on Computers vol. C-32, No. 12, Dec. 1983 


C. L. Seitz. The Cosmic Cube. Communications of the ACM, vol. 28, No. 1. 
January 1985. 


Kk. M. Steele. Implementation of an [-Structure Memory Controller. Technical 


Report LCS/TR-471, MIT, January 1990. 
A. 5. Tanenbaum. Computer Networks. Prentice-Hall Inc., Englewood NJ, 1988 


Thinking Machines Corporation. Connection Machine Model CM-2 Technical 
Summary. Thinking Machines Technical Report HA87-4, Cambridge, MA, April 
1987. 


5S. G. Younis. The Clock Distribution System of the Multiprocessor Emulation 
Facility. Technical Report LCS/TR-366, MIT, June 1986. 


119 


